Page 3 of 5

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 3:22 pm
by JohnChodera
Folks: The new work server (pllwskifah1.mskcc.org) ended up in a weird state that was not receiving WUs even though the WS appeared to be running normally. We've restarted it, and it's now receiving the backlog of results.

Please let us know if you notice this happening again! We'll also try to keep a close eye on it and try to figure out what went wrong here.

Apologies for this---it might be the new big NFS storage we mounted on the WS to attempt to avoid out-of-space issues.

~ John Chodera // MSKCC

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 3:38 pm
by rickoic
My backload is slowly disappearing. Had 7 and now its down to 3, so progress is being made. Tks a lot for the fix.

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 3:44 pm
by mgetz
JohnChodera wrote:Please let us know if you notice this happening again! We'll also try to keep a close eye on it and try to figure out what went wrong here.
~ John Chodera // MSKCC
Can we keep it at zero weight through the weekend unless someone is going to actively keep an eye on it? I'd rather not have my GPUs idled for two days if possible (the science must compute!).

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 3:48 pm
by rickoic
Spoke too soon. This just happened a few minutes ago.

Edit: this problem resolved itself a few minutes later. Just slow.

Code: Select all

15:40:05:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:40:05:WU04:FS01:Assigned to work server 140.163.4.200
15:40:05:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:40:05:WU04:FS01:Connecting to 140.163.4.200:8080
15:40:26:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:40:26:WU04:FS01:Connecting to 140.163.4.200:80
15:40:48:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:40:48:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:40:48:WU04:FS01:Assigned to work server 140.163.4.200
15:40:48:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:40:48:WU04:FS01:Connecting to 140.163.4.200:8080
15:41:09:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:41:09:WU04:FS01:Connecting to 140.163.4.200:80
15:41:31:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:41:47:WU02:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
15:41:47:WU02:FS01:0x22:Average performance: 83.8835 ns/day
15:41:48:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:41:48:WU04:FS01:Assigned to work server 140.163.4.200
15:41:48:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:41:48:WU04:FS01:Connecting to 140.163.4.200:8080
15:41:54:WU02:FS01:0x22:Saving result file ..\logfile_01.txt
15:41:54:WU02:FS01:0x22:Saving result file checkpointState.xml.bz2
15:41:55:WU02:FS01:0x22:Saving result file globals.csv
15:41:55:WU02:FS01:0x22:Saving result file positions.xtc
15:41:55:WU02:FS01:0x22:Saving result file science.log
15:41:55:WU02:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
15:41:56:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
15:41:56:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13426 run:1456 clone:20 gen:4 core:0x22 unit:0x0000000812bc7d9a5f57207fe28d1881
15:41:56:WU02:FS01:Uploading 5.70MiB to 18.188.125.154
15:41:56:WU02:FS01:Connecting to 18.188.125.154:8080
15:42:02:WU02:FS01:Upload 55.94%
15:42:07:WU02:FS01:Upload complete
15:42:07:WU02:FS01:Server responded WORK_ACK (400)
15:42:07:WU02:FS01:Final credit estimate, 176071.00 points
15:42:07:WU02:FS01:Cleaning up
15:42:09:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:42:09:WU04:FS01:Connecting to 140.163.4.200:80
15:42:31:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:43:25:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:43:26:WU04:FS01:Assigned to work server 140.163.4.200
15:43:26:WU04:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:43:26:WU04:FS01:Connecting to 140.163.4.200:8080
15:43:47:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:43:47:WU04:FS01:Connecting to 140.163.4.200:80
15:44:08:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:46:02:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:46:02:WU04:FS01:Assigned to work server 140.163.4.200
15:46:03:WU04:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:46:03:WU04:FS01:Connecting to 140.163.4.200:8080
15:46:24:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:46:24:WU04:FS01:Connecting to 140.163.4.200:80
15:46:45:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 4:14 pm
by JohnChodera
Looks like the server ended up not accepting 80/8080 again. We're going to keep it on weight 0 for a while to monitor.

~ John Chodera // MSKCC

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 4:23 pm
by mgetz
JohnChodera wrote:Looks like the server ended up not accepting 80/8080 again. We're going to keep it on weight 0 for a while to monitor.

~ John Chodera // MSKCC
I have two WUs from it right now:
13436 (22, 5, 2)
13433 (63, 0, 2) completed successfully with no retries 157.664 ns/day

I'll report back in when they finish if they upload or not.

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 8:15 pm
by LazyDev
My two work units have since been uploaded. Thank for fix this.

Re: 140.163.4.200

Posted: Fri Sep 11, 2020 8:58 pm
by mgetz
project:13436 run:22 clone:5 gen:2 core:0x22 did upload... but it took forever, something is seriously messed up with that server.

Re: 140.163.4.200

Posted: Sat Sep 12, 2020 2:53 am
by JohnChodera
Update: it looks like the issue is with an underperforming NFS mount. We're investigating.

Thanks for your patience!

~ John Chodera // MSKCC

Re: 140.163.4.200

Posted: Mon Dec 28, 2020 3:50 pm
by hhherby
I'm noticing this being a super slow connection that keeps timing out.

Re: 140.163.4.200

Posted: Sun Jan 03, 2021 7:46 pm
by hhherby
Can anyone even ping this server?

Pinging 140.163.4.200 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Re: 140.163.4.200

Posted: Sun Jan 03, 2021 8:07 pm
by Joe_H
The server is behind the MSKCC firewall, it blocks pings. If you want to check if the server is up, just enter the IP number into a browser window.

Re: 140.163.4.200

Posted: Mon Jan 04, 2021 7:17 pm
by TristanChen
Going to vent here a bit. The collection server (140.163.4.210) tied to this work server has been barely functional for half of December and is still 90% dead today.

I've got no less than 20 completed work units, some days old with 100+ retries, still waiting for the damned server to fix itself.

Can't admins at least set up some sort of redirect?! If 30% of my daily output is just going to be flushed down the drain anyway, then I might as well be running Nicehash...

Re: 140.163.4.200

Posted: Mon Jan 04, 2021 8:38 pm
by Neil-B
Still happen bit .. but better than April to June last year .. worth posting here as message can be got to the people who look after each impacted server by the core team .. over weekends/holidays issues can be more noticable and some of the servers are in different timezones where getting responses can be trickier

Re: 140.163.4.200

Posted: Tue Jan 05, 2021 2:48 am
by PantherX
FYI, the CS 140.163.4.210 has an update of about 1 hour so was recently rebooted. I am aware that working is being done on it to improve certain aspects.

BTW, redirection will not work with the current setup. The WU will either try to reach out to the WS or the CS (if it is defined) which is determined when it was downloaded by the client. There's no way to dynamically update that information on the WU end.