Page 1 of 1

129.213.40.229 (oracle3.foldingathome.org) is down

Posted: Tue Nov 03, 2020 9:18 am
by jnv11
129.213.40.229 (oracle3.foldingathome.org) is down. See the log below:

Code: Select all

*********************** Log Started 2020-11-03T09:11:19Z ***********************
09:11:19:FS00:Initialized folding slot 00: cpu:18
09:11:19:WU02:FS00:Connecting to assign1.foldingathome.org:80
09:11:20:WU02:FS00:Assigned to work server 129.213.40.229
09:11:20:WU02:FS00:Requesting new work unit for slot 00: cpu:18 from 129.213.40.229
09:11:20:WU02:FS00:Connecting to 129.213.40.229:8080
09:11:41:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
09:11:41:WU02:FS00:Connecting to 129.213.40.229:80
09:12:02:ERROR:WU02:FS00:Exception: Failed to connect to 129.213.40.229:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
09:12:03:WU02:FS00:Connecting to assign1.foldingathome.org:80
09:12:03:WU02:FS00:Assigned to work server 129.213.40.229
09:12:03:WU02:FS00:Requesting new work unit for slot 00: cpu:18 from 129.213.40.229
09:12:03:WU02:FS00:Connecting to 129.213.40.229:8080
09:12:24:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
09:12:24:WU02:FS00:Connecting to 129.213.40.229:80
09:12:45:ERROR:WU02:FS00:Exception: Failed to connect to 129.213.40.229:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
09:13:03:WU02:FS00:Connecting to assign1.foldingathome.org:80
09:13:03:WU02:FS00:Assigned to work server 129.213.40.229
09:13:03:WU02:FS00:Requesting new work unit for slot 00: cpu:18 from 129.213.40.229
09:13:03:WU02:FS00:Connecting to 129.213.40.229:8080
09:13:24:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
09:13:24:WU02:FS00:Connecting to 129.213.40.229:80
09:13:57:ERROR:WU02:FS00:Exception: 10002: Received short response, expected 512 bytes, got 0
09:14:40:WU02:FS00:Connecting to assign1.foldingathome.org:80
09:14:40:WARNING:WU02:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
09:14:40:WU02:FS00:Connecting to assign2.foldingathome.org:80
09:14:40:WU02:FS00:Assigned to work server 129.213.40.229
09:14:40:WU02:FS00:Requesting new work unit for slot 00: cpu:18 from 129.213.40.229
09:14:40:WU02:FS00:Connecting to 129.213.40.229:8080
Going to its website (http://oracle3.foldingathome.org/) from the server status page gets a timeout. Pings fail as well, but I do not know if a firewall is discarding pings or if this is due to some other reason.

Re: 129.213.40.229 (oracle3.foldingathome.org) is down

Posted: Tue Nov 03, 2020 4:00 pm
by Joe_H
I have sent a notification to the people responsible for that server.

Re: 129.213.40.229 (oracle3.foldingathome.org) is down

Posted: Fri Nov 06, 2020 2:38 am
by JimF
It is still down.

Code: Select all

01:20:21:WU00:FS00:0xa7:*********************** Log Started 2020-11-06T01:20:20Z ***********************
01:20:21:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
01:20:21:WU00:FS00:0xa7:       Type: 0xa7
01:20:21:WU00:FS00:0xa7:       Core: Gromacs
01:20:21:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 7454 -checkpoint 15 -np 8
01:20:21:WU00:FS00:0xa7:************************************ CBang *************************************
01:20:21:WU00:FS00:0xa7:       Date: Nov 27 2019
01:20:21:WU00:FS00:0xa7:       Time: 11:26:54
01:20:21:WU00:FS00:0xa7:   Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
01:20:21:WU00:FS00:0xa7:     Branch: master
01:20:21:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
01:20:21:WU00:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
01:20:21:WU00:FS00:0xa7:             -fno-pie -fPIC
01:20:21:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
01:20:21:WU00:FS00:0xa7:       Bits: 64
01:20:21:WU00:FS00:0xa7:       Mode: Release
01:20:21:WU00:FS00:0xa7:************************************ System ************************************
01:20:21:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
01:20:21:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
01:20:21:WU00:FS00:0xa7:       CPUs: 12
01:20:21:WU00:FS00:0xa7:     Memory: 15.57GiB
01:20:21:WU00:FS00:0xa7:Free Memory: 7.91GiB
01:20:21:WU00:FS00:0xa7:    Threads: POSIX_THREADS
01:20:21:WU00:FS00:0xa7: OS Version: 5.4
01:20:21:WU00:FS00:0xa7:Has Battery: false
01:20:21:WU00:FS00:0xa7: On Battery: false
01:20:21:WU00:FS00:0xa7: UTC Offset: -5
01:20:21:WU00:FS00:0xa7:        PID: 7458
01:20:21:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
01:20:21:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
01:20:21:WU00:FS00:0xa7:    Version: 0.0.19
01:20:21:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
01:20:21:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
01:20:21:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
01:20:21:WU00:FS00:0xa7:       Date: Nov 26 2019
01:20:21:WU00:FS00:0xa7:       Time: 00:41:42
01:20:21:WU00:FS00:0xa7:   Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
01:20:21:WU00:FS00:0xa7:     Branch: master
01:20:21:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
01:20:21:WU00:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
01:20:21:WU00:FS00:0xa7:             -fno-pie
01:20:21:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
01:20:21:WU00:FS00:0xa7:       Bits: 64
01:20:21:WU00:FS00:0xa7:       Mode: Release
01:20:21:WU00:FS00:0xa7:************************************ Build *************************************
01:20:21:WU00:FS00:0xa7:       SIMD: avx_256
01:20:21:WU00:FS00:0xa7:********************************************************************************
01:20:21:WU00:FS00:0xa7:Project: 16925 (Run 9, Clone 8643, Gen 1)
01:20:21:WU00:FS00:0xa7:Unit: 0x0000000181d528e500000000000921c3
01:20:21:WU00:FS00:0xa7:Digital signatures verified
01:20:21:WU00:FS00:0xa7:Calling: mdrun -s frame1.tpr -o frame1.trr -cpi state.cpt -cpt 15 -nt 8
01:20:21:WU00:FS00:0xa7:Steps: first=500000 total=500000
01:20:22:WU00:FS00:0xa7:Completed 348622 out of 500000 steps (69%)
01:20:58:WU00:FS00:0xa7:Completed 350000 out of 500000 steps (70%)
01:23:14:WU00:FS00:0xa7:Completed 355000 out of 500000 steps (71%)
01:25:32:WU00:FS00:0xa7:Completed 360000 out of 500000 steps (72%)
01:27:47:WU00:FS00:0xa7:Completed 365000 out of 500000 steps (73%)
01:30:04:WU00:FS00:0xa7:Completed 370000 out of 500000 steps (74%)
01:32:20:WU00:FS00:0xa7:Completed 375000 out of 500000 steps (75%)
01:34:34:WU00:FS00:0xa7:Completed 380000 out of 500000 steps (76%)
01:36:49:WU00:FS00:0xa7:Completed 385000 out of 500000 steps (77%)
01:39:07:WU00:FS00:0xa7:Completed 390000 out of 500000 steps (78%)
01:41:22:WU00:FS00:0xa7:Completed 395000 out of 500000 steps (79%)
01:43:37:WU00:FS00:0xa7:Completed 400000 out of 500000 steps (80%)
01:45:53:WU00:FS00:0xa7:Completed 405000 out of 500000 steps (81%)
01:48:08:WU00:FS00:0xa7:Completed 410000 out of 500000 steps (82%)
01:50:27:WU00:FS00:0xa7:Completed 415000 out of 500000 steps (83%)
01:52:45:WU00:FS00:0xa7:Completed 420000 out of 500000 steps (84%)
01:55:00:WU00:FS00:0xa7:Completed 425000 out of 500000 steps (85%)
01:57:15:WU00:FS00:0xa7:Completed 430000 out of 500000 steps (86%)
01:59:30:WU00:FS00:0xa7:Completed 435000 out of 500000 steps (87%)
02:01:44:WU00:FS00:0xa7:Completed 440000 out of 500000 steps (88%)
02:03:59:WU00:FS00:0xa7:Completed 445000 out of 500000 steps (89%)
02:06:20:WU00:FS00:0xa7:Completed 450000 out of 500000 steps (90%)
02:08:38:WU00:FS00:0xa7:Completed 455000 out of 500000 steps (91%)
02:10:53:WU00:FS00:0xa7:Completed 460000 out of 500000 steps (92%)
02:13:10:WU00:FS00:0xa7:Completed 465000 out of 500000 steps (93%)
02:15:26:WU00:FS00:0xa7:Completed 470000 out of 500000 steps (94%)
02:17:43:WU00:FS00:0xa7:Completed 475000 out of 500000 steps (95%)
02:20:00:WU00:FS00:0xa7:Completed 480000 out of 500000 steps (96%)
02:22:17:WU00:FS00:0xa7:Completed 485000 out of 500000 steps (97%)
02:24:35:WU00:FS00:0xa7:Completed 490000 out of 500000 steps (98%)
02:26:52:WU00:FS00:0xa7:Completed 495000 out of 500000 steps (99%)
02:26:53:WU01:FS00:Connecting to 65.254.110.245:80
02:26:54:WU01:FS00:Assigned to work server 129.213.40.229
02:26:54:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:10 from 129.213.40.229
02:26:54:WU01:FS00:Connecting to 129.213.40.229:8080
02:29:04:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
02:29:04:WU01:FS00:Connecting to 129.213.40.229:80
02:29:07:WU00:FS00:0xa7:Completed 500000 out of 500000 steps (100%)
02:29:09:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
02:29:09:WU00:FS00:0xa7:Saving result file frame1.trr
02:29:09:WU00:FS00:0xa7:Saving result file md.log
02:29:09:WU00:FS00:0xa7:Saving result file science.log
02:29:09:WU00:FS00:0xa7:Saving result file traj_comp.xtc
02:29:09:WU00:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
02:29:09:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
02:29:09:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:16925 run:9 clone:8643 gen:1 core:0xa7 unit:0x0000000181d528e500000000000921c3
02:29:09:WU00:FS00:Uploading 6.16MiB to 129.213.40.229
02:29:09:WU00:FS00:Connecting to 129.213.40.229:8080
02:31:15:ERROR:WU01:FS00:Exception: Failed to connect to 129.213.40.229:80: Connection timed out
02:31:15:WU01:FS00:Connecting to 65.254.110.245:80
02:31:16:WU01:FS00:Assigned to work server 129.213.40.229
02:31:16:WU01:FS00:Requesting new work unit for slot 00: READY cpu:10 from 129.213.40.229
02:31:16:WU01:FS00:Connecting to 129.213.40.229:8080
02:31:19:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
02:31:19:WU00:FS00:Connecting to 129.213.40.229:80
02:33:09:FS00:Finishing
02:33:26:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
02:33:26:WU01:FS00:Connecting to 129.213.40.229:80
02:33:31:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 129.213.40.229:80: Connection timed out
02:33:31:WU00:FS00:Trying to send results to collection server
02:33:31:WU00:FS00:Uploading 6.16MiB to 129.213.157.105
02:33:31:WU00:FS00:Connecting to 129.213.157.105:8080
02:33:34:WU00:FS00:Upload complete
02:33:34:WU00:FS00:Server responded WORK_ACK (400)
02:33:34:WU00:FS00:Final credit estimate, 12492.00 points
02:33:34:WU00:FS00:Cleaning up
02:35:26:FS00:Paused
02:35:37:ERROR:WU01:FS00:Exception: Failed to connect to 129.213.40.229:80: Connection timed out

Re: 129.213.40.229 (oracle3.foldingathome.org) is down

Posted: Fri Nov 06, 2020 2:52 am
by Joe_H
JimF wrote:It is still down.
Actually it is up and has been for a day. The server is slow on responding to connections, they are looking into that. Some WUs are being successfully uploaded, it may just take time and a few retries to get a connection going.

Re: 129.213.40.229 (oracle3.foldingathome.org) is down

Posted: Sun Nov 08, 2020 6:05 am
by bruce
When a server has been down some WUs may have been accepted by a Collection Server in lieu of the Work Server. Other WUs may be refused and retained by FAHClient on everybody's local storage. When the server comes back online, there can be a backlog of those WUs waiting on everybody's machine and this can saturates the WS's bandwidth. Connections have to be limited so that the server's bandwidth is utilized effectively without being exceeded until the backlog can be processed.

That might explain the slow response ... or it might be something else.