Page 3 of 3

Re: 128.252.203.10 problem or WU?

Posted: Wed May 06, 2020 5:24 am
by level6
Ah, excellent info, thanks PantherX! And, that is great news, indeed.

I definitely have more lurking to do to understand these details better.

Re: 128.252.203.10 problem or WU?

Posted: Wed May 06, 2020 5:31 am
by PantherX
Oussebon wrote:...Anything to be done?
Apart from reporting it in the Forum which is then raised to the researcher, not much :( You can simply leave the client running and hopefully, the completed WU is uploaded before the Expiration date. Apart from that, there's not much one can do unless the researchers ask for something specific.

Re: 128.252.203.10 problem or WU?

Posted: Wed May 06, 2020 12:18 pm
by Oussebon
GDF wrote:This is only anecdotal, worked for me, and might have been complete coincidence. I paused the slot with the problem, waited for the server to reboot (which you can see on the serverstats page by watching uptime roll back to zero), then restarted the slot. The upload went right through.
Thanks for the tip. The server was just restarted but sadly no joy.

Fails at 0.23% as above.

Although what used to happen is that it might occasionally try to start sending, getting as far as 0.23%, and fail, then the server would "actively refuse" the connection for every subsequent attempt.

Now, it starts uploading every time it is meant to. It just always fails at 0.23%.

Same messages as per previously-posted logs (Transfer failed).
PantherX wrote:
Oussebon wrote:...Anything to be done?
Apart from reporting it in the Forum which is then raised to the researcher, not much :( You can simply leave the client running and hopefully, the completed WU is uploaded before the Expiration date. Apart from that, there's not much one can do unless the researchers ask for something specific.
Thanks - and sorry, initially missed your post somehow! As things have changed a little (first hurdle overcome, still 2nd hurdle in the way) I hope the update helps them narrow it down. Shame to waste WUs after all.

Further edit: Scratch the above - back to the active refused error message as per last post.

Re: 128.252.203.10 problem or WU?

Posted: Wed May 06, 2020 3:53 pm
by GDF
PantherX wrote:I do understand your POV and it negatively impacts all involved, the researchers and the donors. However, considering that there are multiple labs involved (https://foldingathome.org/about/the-fol ... onsortium/) across the globe in various countries dealing with various lock-down policies, even on a "good" day, it would take a bit of time. In a pandemic situation, it is a lot harder but no-one has given up and instead, they have double-down and working to improving various aspects to ensure that it is fixed. Sometimes, labs will have to involve their internal IT department which can also add to the delay if it is a University infrastructure limitation like internet or electricity.
Thanks for the welcome, and I understand the problem intimately, as remote server management is something I do as part of my day job. Right now I'm having to get explicit permission to be in the building where my hardware is located, so I'm grateful that nearly all of the process can be managed through the net. I'm also glad I don't have to hear about problems through word of mouth in a public forum!

I appreciate all the effort that is going into this and I'm happy to be able to play a tiny part.

Re: 128.252.203.10 problem or WU?

Posted: Wed May 06, 2020 4:59 pm
by kalamai2
Failing for me as well for at least a few hours (log freshly started after a restart attempt). Does seem redundancy/scaling on the work collection servers would be helpful.

Code: Select all

*********************** Log Started 2020-05-06T16:37:53Z ***********************
16:37:53:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:37:53:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:37:53:WU00:FS01:Connecting to 128.252.203.10:8080
16:38:15:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
16:38:15:WU00:FS01:Connecting to 128.252.203.10:80
16:38:36:WARNING:WU00:FS01:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:38:37:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:38:37:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:38:37:WU00:FS01:Connecting to 128.252.203.10:8080
16:38:58:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
16:38:58:WU00:FS01:Connecting to 128.252.203.10:80
16:39:19:WARNING:WU00:FS01:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:39:37:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:39:37:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:39:37:WU00:FS01:Connecting to 128.252.203.10:8080
16:41:06:WU00:FS01:Upload 0.54%
16:41:06:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:41:14:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:41:14:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:41:14:WU00:FS01:Connecting to 128.252.203.10:8080
16:41:30:WU00:FS01:Upload 0.27%
16:42:38:WU00:FS01:Upload 0.54%
16:42:38:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:43:52:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:43:52:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:43:52:WU00:FS01:Connecting to 128.252.203.10:8080
16:45:29:WU00:FS01:Upload 0.54%
16:45:29:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:45:30:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:45:30:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:45:30:WU00:FS01:Connecting to 128.252.203.10:8080
16:45:49:WU00:FS01:Upload 0.54%
16:45:49:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:47:07:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:47:07:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:47:07:WU00:FS01:Connecting to 128.252.203.10:8080
16:47:13:WU00:FS01:Upload 20.33%
16:47:19:WU00:FS01:Upload 84.83%
16:47:54:WARNING:WU00:FS01:Exception: Failed to send results to work server: 10002: Received short response, expected 512 bytes, got 0
16:49:44:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:49:44:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:49:44:WU00:FS01:Connecting to 128.252.203.10:8080
16:53:01:WU00:FS01:Upload 1.63%
16:53:01:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:53:58:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:6289 gen:29 core:0x22 unit:0x0000003180fccb0a5e6f0a8922c6cfcd
16:53:59:WU00:FS01:Uploading 23.06MiB to 128.252.203.10
16:53:59:WU00:FS01:Connecting to 128.252.203.10:8080
16:54:01:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
16:54:01:WU00:FS01:Connecting to 128.252.203.10:80
16:54:04:WARNING:WU00:FS01:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: No connection could be made because the target machine actively refused it.

Re: 128.252.203.10 problem or WU?

Posted: Wed May 06, 2020 7:05 pm
by kalamai2
This finally uploaded for me, I noticed a fresh restart in server stats and also just updated my client version to .13 - I imagine it was the server restart that got it working but who knows :)

Thanks.

Mike

Re: 128.252.203.10 problem or WU?

Posted: Sat May 09, 2020 9:50 am
by Jeanne de Flandre
Hi,

Upload to 128.252.203.10 for Project 11760 still keeps failing from 2020/05/06 21:43.

The estimated credit is already same as the base credit. I don't care my 'lost credit', but I want that the server will receive the result.

Code: Select all

*********************** Log Started 2020-05-06T17:20:44Z ***********************
...
21:43:03:WU00:FS01:Uploading 23.09MiB to 128.252.203.10
21:43:03:WU00:FS01:Connecting to 128.252.203.10:8080
21:43:18:WU00:FS01:Upload 0.27%
21:43:42:WU00:FS01:Upload 0.54%
21:43:43:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
21:43:43:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11760 run:0 clone:1179 gen:25 core:0x22 unit:0x0000003080fccb0a5e6d7cd5382edcbf
...
*********************** Log Started 2020-05-09T02:19:53Z ***********************
...
05:54:57:WU00:FS01:Uploading 23.09MiB to 128.252.203.10
05:54:57:WU00:FS01:Connecting to 128.252.203.10:8080
05:55:12:WU00:FS01:Upload 0.54%
05:55:12:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed

Re: 128.252.203.10 problem or WU?

Posted: Sat May 09, 2020 11:09 am
by PantherX
Welcome to the F@H Forum Jeanne de Flandre,

Please note that the Server 128.252.203.10 has an uptime of ~15 minutes which means that it was recently restarted. Thus, your completed WU will hopefully be accepted soon :)

Re: 128.252.203.10 problem or WU?

Posted: Sat May 09, 2020 11:35 am
by Jeanne de Flandre
Thanks for your reply PantherX.
Each time I see https://apps.foldingathome.org/serverstats , 128.252.203.10 *always* has quite short uptime and now it says "a few seconds". So I guess that it repeats rebooting.

Re: 128.252.203.10 problem or WU?

Posted: Sat May 09, 2020 8:00 pm
by PantherX
Thanks for that, I have informed the researcher so let's see what happens :)

Re: 128.252.203.10 problem or WU?

Posted: Sat May 09, 2020 10:30 pm
by bruce
anandhanju wrote:Thanks for your reports. The necessary folks have been notified and they will be looking into this.
Since the start of the COVID surge, the demand for new assignments has regularly exceeded the available bandwidth on FAH's servers. As fast as FAH could add more servers, the demand increased even more. Code was added to the Assignment Server to limit this excess to something on the order of what can actually be useful.

From my observations, there's nothing really limiting the bandwidth of the WUs being returned. When a FAHClient decides it's time to upload a result, it proceeds without any knowledge of how much inbound bandwidth is available. I frequently see very slow upload speeds, which IMHO indicates the inbound path is (probably) saturated. I can't think of a good way to manage that bandwidth other than to let an increasing percentage of upload transactions fail and redirect them to a Collection Server.

Rebooting the server, of course, terminates the active uploads and then it takes some time for the backlog to decide to retry. That's not a very good system but as I said, I can't think of a better solution.

Comments anyone?

Re: 128.252.203.10 problem or WU?

Posted: Sat May 09, 2020 10:45 pm
by PantherX
Please note that the Server (128.252.203.10) has been poked and it seems to be stable now. Let's see if your WUs are now uploaded without issues :)
bruce wrote:...I can't think of a good way to manage that bandwidth other than to let an increasing percentage of upload transactions fail and redirect them to a Collection Server...
Is it possible to alternate the WU being allocated to say WU A will return to WS and WU B will return to CS? In other words, alternate the primary and secondary returns. That way, the initial impact on the WS has been "halved" but then the CS has gone from being a backup to being a production one. Plus, the links between the WS and CS will now be used continuously during production as opposed to only for backup.

Re: 128.252.203.10 problem or WU?

Posted: Sun May 10, 2020 6:00 am
by Jeanne de Flandre
Thanks for your intervention. Now it is finally uploaded. :)

Code: Select all

******************************* Date: 2020-05-09 *******************************
23:54:57:WU00:FS01:Uploading 23.09MiB to 128.252.203.10
23:54:57:WU00:FS01:Connecting to 128.252.203.10:8080
23:55:03:WU00:FS01:Upload 2.98%
...
23:58:15:WU00:FS01:Upload 98.51%
23:58:19:WU00:FS01:Upload complete
23:58:19:WU00:FS01:Server responded WORK_ACK (400)
23:58:19:WU00:FS01:Final credit estimate, 12884.00 points
23:58:19:WU00:FS01:Cleaning up

Re: 128.252.203.10 problem or WU?

Posted: Sun May 10, 2020 6:32 am
by PantherX
That's great to hear! Thanks for the confirmation :)