Page 12 of 12

Re: Send Errors - 155.247.164.213 & .214

Posted: Fri Apr 17, 2020 2:12 pm
by astronomyrat
Does it make sens if I configure my FW to forbid requests from the client to 155.247.164.213 in order to not receive WU from this server as it seems that his upload server (155.247.166.219) is chronically overloaded? Or will this have as effect that the client will wait for a reply that never comes?

You will agree that it don't make sens that we waste electricity to process WU who will be trashed at the end because the bandwidth of the upload server is overloaded for days, and as side effect it will reduce the amount of requests on this server allowing the other clients to better connect.

Re: Send Errors - 155.247.164.213 & .214

Posted: Fri Apr 17, 2020 4:23 pm
by TheWolf
Still no luck.

Code: Select all

16:19:09:WU03:FS00:0xa7:Calling: mdrun -s frame16.tpr -o frame16.trr -cpi state.cpt -cpt 3 -nt 4
16:19:09:WU03:FS00:0xa7:Steps: first=0 total=250000
16:19:14:WU03:FS00:0xa7:Completed 170018 out of 250000 steps (68%)
16:19:30:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
16:19:30:WU02:FS00:Connecting to 155.247.164.213:80
16:19:30:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
16:19:30:WU01:FS00:Connecting to 155.247.164.213:80
16:19:51:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to connect to 155.247.164.213:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:19:51:WU02:FS00:Trying to send results to collection server
16:19:51:WARNING:WU01:FS00:Exception: Failed to send results to work server: Failed to connect to 155.247.164.213:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:19:51:WU02:FS00:Uploading 8.31MiB to 155.247.166.219
16:19:51:WU01:FS00:Trying to send results to collection server
16:19:51:WU02:FS00:Connecting to 155.247.166.219:8080
16:19:51:WU01:FS00:Uploading 8.33MiB to 155.247.166.219
16:19:51:WU01:FS00:Connecting to 155.247.166.219:8080
16:20:10:WU02:FS00:Upload 0.75%
16:20:10:ERROR:WU02:FS00:Exception: Transfer failed
16:20:10:WU01:FS00:Upload 0.75%
16:20:10:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:272 clone:1 gen:3 core:0xa7 unit:0x000000039bf7a4d55e8b1f06003b13dc
16:20:10:ERROR:WU01:FS00:Exception: Transfer failed
16:20:10:WU02:FS00:Uploading 8.31MiB to 155.247.164.213
16:20:10:WU02:FS00:Connecting to 155.247.164.213:8080
16:20:11:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:14643 run:403 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b9f634ceb52d4
16:20:11:WU01:FS00:Uploading 8.33MiB to 155.247.164.213
16:20:11:WU01:FS00:Connecting to 155.247.164.213:8080
16:20:31:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
16:20:31:WU02:FS00:Connecting to 155.247.164.213:80
16:20:32:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
16:20:32:WU01:FS00:Connecting to 155.247.164.213:80
16:20:53:WARNING:WU01:FS00:Exception: Failed to send results to work server: Failed to connect to 155.247.164.213:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:20:53:WU01:FS00:Trying to send results to collection server
16:20:53:WU01:FS00:Uploading 8.33MiB to 155.247.166.219
16:20:53:WU01:FS00:Connecting to 155.247.166.219:8080
16:21:48:WU02:FS00:Upload 0.75%
16:21:48:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
16:21:48:WU02:FS00:Trying to send results to collection server
16:21:48:WU02:FS00:Uploading 8.31MiB to 155.247.166.219
16:21:48:WU02:FS00:Connecting to 155.247.166.219:8080
16:21:54:WU01:FS00:Upload 0.75%
16:21:54:ERROR:WU01:FS00:Exception: Transfer failed
16:21:54:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:14643 run:403 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b9f634ceb52d4
16:21:54:WU01:FS00:Uploading 8.33MiB to 155.247.164.213
16:21:54:WU01:FS00:Connecting to 155.247.164.213:8080

Re: Send Errors - 155.247.164.213 & .214

Posted: Fri Apr 17, 2020 4:48 pm
by TheWolf
The oldest WU timeout today the other tomorrow. Both are 500 point WU's. A waste of my time and money. My power bill is high enough as is and using extra to do this work for no good reason it looks like at this point in time. As you know computers use a great deal more power when there under a load 24/7 365 day a year.

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 1:44 am
by TheWolf
In 24 more mins once the WU I am working now as completed and uploaded I will delete the slot and someone else can waste there time with these two failed WU's.

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 1:00 pm
by Neil-B
I understand how you feel (slow/delayed uploads are no-ones happy place) - and of course the choice is yours … but the actual power usage once the WU has been completed and is just waiting to upload is minimal tbh … the WU still has scientific value beyond the Timeout, and it is quite likely that your system might be the first to return the WU even after this point thereby allowing the science to continue earlier.

Yes, it is frustrating the uploads are not getting through immediately on some servers at the moment -you may not believe it but it is even more so for the researchers whose work is being held up by these issues - but the technical teams are working to improve this as quickly as they can.

You are perfectly right to chose what you want to do - although dumping the WUs having completed them means the time/money/power has then truly been wasted :(

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 1:43 pm
by Arnold0
Hi, I have a WU that won't upload, but it is not on 213 & 214 but on 213 & 166.219. Starts uploading a tiny little bit then fails everytimes at 1.51%, when I open these IPs in a web browser it takes a long time but the Work Server web page appears so these servers are not down but they wont accept uploads. It's a CPU WU that's now only worth 500 points, and it expires in two days on the 20th.

Code: Select all

*********************** Log Started 2020-04-18T13:09:43Z ***********************
13:09:44:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:09:44:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:09:44:WU02:FS00:Connecting to 155.247.164.213:8080
13:10:05:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
13:10:05:WU02:FS00:Connecting to 155.247.164.213:80
13:10:06:WU02:FS00:Upload 0.76%
13:14:38:WU02:FS00:Upload 1.51%
13:14:38:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
13:14:38:WU02:FS00:Trying to send results to collection server
13:14:38:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:14:38:WU02:FS00:Connecting to 155.247.166.219:8080
13:18:48:WU02:FS00:Upload 1.51%
13:18:48:ERROR:WU02:FS00:Exception: Transfer failed
13:18:48:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:18:48:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:18:48:WU02:FS00:Connecting to 155.247.164.213:8080
13:19:08:WU02:FS00:Upload 1.51%
13:19:08:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
13:19:08:WU02:FS00:Trying to send results to collection server
13:19:08:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:19:08:WU02:FS00:Connecting to 155.247.166.219:8080
13:19:29:WU02:FS00:Upload 1.51%
13:19:29:ERROR:WU02:FS00:Exception: Transfer failed
13:19:48:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:19:48:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:19:48:WU02:FS00:Connecting to 155.247.164.213:8080
13:20:10:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
13:20:10:WU02:FS00:Connecting to 155.247.164.213:80
13:20:13:WU02:FS00:Upload 0.76%
13:24:39:WU02:FS00:Upload 1.51%
13:24:39:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
13:24:39:WU02:FS00:Trying to send results to collection server
13:24:39:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:24:39:WU02:FS00:Connecting to 155.247.166.219:8080
13:25:15:WU02:FS00:Upload 1.51%
13:25:15:ERROR:WU02:FS00:Exception: Transfer failed
13:25:15:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:25:15:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:25:15:WU02:FS00:Connecting to 155.247.164.213:8080
13:27:01:WU02:FS00:Upload 1.51%
13:27:01:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
13:27:01:WU02:FS00:Trying to send results to collection server
13:27:01:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:27:01:WU02:FS00:Connecting to 155.247.166.219:8080
13:27:21:WU02:FS00:Upload 1.51%
13:27:21:ERROR:WU02:FS00:Exception: Transfer failed
13:27:52:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:27:52:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:27:52:WU02:FS00:Connecting to 155.247.164.213:8080
13:28:14:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
13:28:14:WU02:FS00:Connecting to 155.247.164.213:80
13:28:35:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to connect to 155.247.164.213:80: Une tentative de connexion a échoué car le parti connecté n’a pas répondu convenablement au-delà d’une certaine durée ou une connexion établie a échoué car l’hôte de connexion n’a pas répondu.
13:28:35:WU02:FS00:Trying to send results to collection server
13:28:35:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:28:35:WU02:FS00:Connecting to 155.247.166.219:8080
13:32:45:WU02:FS00:Upload 1.51%
13:32:45:ERROR:WU02:FS00:Exception: Transfer failed
13:32:46:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:32:46:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:32:46:WU02:FS00:Connecting to 155.247.164.213:8080
13:33:35:WU02:FS00:Upload 1.51%
13:33:35:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
13:33:35:WU02:FS00:Trying to send results to collection server
13:33:35:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:33:35:WU02:FS00:Connecting to 155.247.166.219:8080
13:34:09:WU02:FS00:Upload 1.51%
13:34:09:ERROR:WU02:FS00:Exception: Transfer failed
13:39:37:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14638 run:343 clone:1 gen:0 core:0xa7 unit:0x000000009bf7a4d55e8b1f025355b47f
13:39:37:WU02:FS00:Uploading 8.26MiB to 155.247.164.213
13:39:37:WU02:FS00:Connecting to 155.247.164.213:8080
13:39:58:WARNING:WU02:FS00:WorkServer connection failed on port 8080 trying 80
13:39:58:WU02:FS00:Connecting to 155.247.164.213:80
13:40:19:WARNING:WU02:FS00:Exception: Failed to send results to work server: Failed to connect to 155.247.164.213:80: Une tentative de connexion a échoué car le parti connecté n’a pas répondu convenablement au-delà d’une certaine durée ou une connexion établie a échoué car l’hôte de connexion n’a pas répondu.
13:40:19:WU02:FS00:Trying to send results to collection server
13:40:19:WU02:FS00:Uploading 8.26MiB to 155.247.166.219
13:40:19:WU02:FS00:Connecting to 155.247.166.219:8080
13:40:42:WU02:FS00:Upload 1.51%
13:40:42:ERROR:WU02:FS00:Exception: Transfer failed
Mod Edit: Change Quote Tags To Code Tags - PantherX

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 1:49 pm
by Neil-B
There are reports that some WUs get through - but there may be a big backlog fighting for bandwidth :(

It the WU does reach the Exipration date then unfortunately your patience and efforts (and resources) on this WU will have been for nought :( - but not for the want of everyone trying … If you got this WU first and someone has got it after Timeout, they may well be stuck in the same queue - by having given it a chance to upload up until Expiration date you have maximised the chances of it being accepted.

You should find that once a WU passes Expiration is will automatically delete.

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 3:57 pm
by phrenq
Hello,

I also have 2 WUs stuck for upload towards 155.247.164.213 / 155.247.166.219, One for 4 days, one more than one day.

Code: Select all

11:34:04:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14640 run:1657 clone:1 gen:8 core:0xa7 unit:0x000000089bf7a4d55e8b9f706b354299
11:34:04:WU02:FS00:Uploading 8.31MiB to 155.247.164.213
11:34:04:WU02:FS00:Connecting to 155.247.164.213:8080
11:50:35:WU02:FS00:Upload 0.75%
11:50:35:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
11:50:35:WU02:FS00:Trying to send results to collection server
11:50:35:WU02:FS00:Uploading 8.31MiB to 155.247.166.219
11:50:35:WU02:FS00:Connecting to 155.247.166.219:8080
12:06:07:WU02:FS00:Upload 0.75%
12:06:07:ERROR:WU02:FS00:Exception: Transfer failed

13:09:21:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14367 run:1155 clone:3 gen:21 core:0xa7 unit:0x000000189bf7a4d55e84b0fa151cf4c8
13:09:21:WU00:FS00:Uploading 6.45MiB to 155.247.164.213
13:09:21:WU00:FS00:Connecting to 155.247.164.213:8080
13:25:30:WU00:FS00:Upload 0.97%
13:25:31:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
13:25:31:WU00:FS00:Trying to send results to collection server
13:25:31:WU00:FS00:Uploading 6.45MiB to 155.247.166.219
13:25:31:WU00:FS00:Connecting to 155.247.166.219:8080
13:41:56:WU00:FS00:Upload 0.97%
13:41:56:ERROR:WU00:FS00:Exception: Transfer failed

Where can I find information about expiration ? And is there a way to clean them up if they continue to fail ?

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 5:25 pm
by davidcoton
Expiration is given in the right-hand pane of the Status tab of FAHControl. Once they expire, they will be deleted by the client. But let's hope they upload before then.
Try a Pause then Fold on the relevant slot, that will reset the retry delay. Just don't do it too often....

Re: Send Errors - 155.247.164.213 & .214

Posted: Sat Apr 18, 2020 5:33 pm
by Arnold0
I updated my client and I don't know if it is related but it sucesfully uploaded now. I actually have another one that won't upload that finished folding today but it is on a different server and it uploads then gets a PLEASE_WAIT (464) error.

Re: Send Errors - 155.247.164.213 & .214

Posted: Thu Apr 23, 2020 12:29 pm
by MM54
I've got two completed WUs that have been trying to upload since yesterday afternoon to 155.247.164.213 and 155.247.166.219 (same IPs on both WUs) but these servers aren't showing as being down in the status. Any ideas on this or is it just more of things getting overloaded?

Code: Select all

12:06:46:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:14640 run:842 clone:1 gen:13 core:0xa7 unit:0x0000000f9bf7a4d55e8b9f9dc43a4470
12:06:46:WU00:FS00:Uploading 8.31MiB to 155.247.164.213
12:06:46:WU00:FS00:Connecting to 155.247.164.213:8080
12:11:27:WU00:FS00:Upload 1.50%
12:11:27:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
12:11:27:WU00:FS00:Trying to send results to collection server
12:11:27:WU00:FS00:Uploading 8.31MiB to 155.247.166.219
12:11:27:WU00:FS00:Connecting to 155.247.166.219:8080
12:17:36:WU00:FS00:Upload 1.50%
12:17:36:ERROR:WU00:FS00:Exception: Transfer failed

Code: Select all

12:16:49:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:14365 run:97 clone:0 gen:37 core:0xa7 unit:0x0000002b9bf7a4d55e81239506b80678
12:16:49:WU01:FS00:Uploading 6.49MiB to 155.247.164.213
12:16:49:WU01:FS00:Connecting to 155.247.164.213:8080
12:17:08:WU01:FS00:Upload 1.93%
12:17:08:WARNING:WU01:FS00:Exception: Failed to send results to work server: Transfer failed
12:17:08:WU01:FS00:Trying to send results to collection server
12:17:08:WU01:FS00:Uploading 6.49MiB to 155.247.166.219
12:17:08:WU01:FS00:Connecting to 155.247.166.219:8080
12:20:53:WU01:FS00:Upload 1.93%
12:20:54:ERROR:WU01:FS00:Exception: Transfer failed

Re: Send Errors - 155.247.164.213 & .214

Posted: Thu Apr 23, 2020 1:29 pm
by Neil-B
Thanks for the report ... These servers are fairly loaded and have been overloaded on and off over the last week or so ... If they are suffering again someone will no doubt look into to check there is no bigger issue, but for the moment all you can do is let you client keep retrying until it gets the uploads through.

Re: Send Errors - 155.247.164.213 & .214

Posted: Sun Apr 26, 2020 4:28 am
by intrepidpursuit
I am having a similar problem. For me it is WU 16435 uploading to 3.21.157.11. My server has downloaded this 3 times to my M2000 GPU and each time it takes 17 hours to complete and then it is never uploaded. My other GPU has uploaded WUs to this same server during the same time period and I did managed to get my M2000 to process another WU by removing and reading it to the configuration and the other WU uploaded fine. Now it has downloaded 16435 again and I'm going to spend another 17 hours processing it for it to go nowhere.

I'm not trying to complain, I'm donating GPU time that would not otherwise be used. But if this GPU and the power it eats is just continuously wasted then I might as well take it offline. Some ability to reject WUs like this that fail repeatedly would be very nice.

Re: Send Errors - 155.247.164.213 & .214

Posted: Sun Apr 26, 2020 8:12 am
by Neil-B
If the team knew which servers were going to have issues then they would sort them in advance as frustrating as it is to the folders this type of issue is more frustrating to the scientists as it holds up the science … These type of delays are simply due to the rapid growth of FAH to meet the demand of a significantly increased folding community … this increase is still stressing/overloading servers and it takes time to resolve such issues - and it is now the weekend, and lockdowns are still in force, so we just need to be patient … People will be trying to get this fixed as quickly as they can.

Yes, it is frustrating … but not sure if rejecting WUs that fail to upload could actually work? … how would the system know in advance which one is going to have comms/overload issues? … You have posted to a thread about one pair of servers concerning delays on a different server - It just shows these things are happening across the infrastructure … some of the servers have had more issues than others, but over the last month most have had some forms of interruptions and downtime … It is improving - even though it might not seem that - there are still issues, but the throughput behind those issues has scaled up by a massive amount.

Given time things will stabilise and this type of thread will become a thing of the past … If this is just the server in question being overloaded your WUs may upload as/when the server gets a free slow and your client catches it … but if it is an issue with the server then for now I fear it will be a least Monday both this resolved.