WU's not being send to work server 3.*

hnapel · Post by **hnapel** » Thu Apr 30, 2020 10:03 am

hi team,

getting this repeatedly:

09:55:14:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:16435 run:2837 clone:0 gen:1 core:0x22 unit:0x0000000203854c135e9a4ef707e6e5bb
09:55:14:WU00:FS01:Uploading 133.91MiB to 3.133.76.19
09:55:14:WU00:FS01:Connecting to 3.133.76.19:8080
09:55:14:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
09:55:14:WU00:FS01:Trying to send results to collection server
09:55:14:WU00:FS01:Uploading 133.91MiB to 3.21.157.11
09:55:14:WU00:FS01:Connecting to 3.21.157.11:8080
09:55:14:ERROR:WU00:FS01:Exception: Transfer failed

meaning both work server and collection server are out, then it enters the backoff procedure but even after several attempts it does not work

my internet connection is ok, case in point it downloaded more work for another project with other work and collection servers.

This WU processed work for project 16435, it took about 8 hours to run on my Radeon RX 570, hope this can be resolved,

greetings,

Harm

hnapel · Post by **hnapel** » Thu Apr 30, 2020 10:10 am

According to the server stats page the work server IP is aws1.foldingathome.org and we will need to wake up joseph from his slumber, the collection server is aws2.foldingathome.org

Neil-B · Post by **Neil-B** » Thu Apr 30, 2020 10:10 am

Many posts/threads on this … Known issue which the team are working as quickly as possible to resolve … your client will keep trying to upload until it hopefully does, or until expiration is reached at which point the client will delete the WU which will be unfortunate, but sometimes these issues happen and take longer than anyone would like to get resolved.

hnapel · Post by **hnapel** » Thu Apr 30, 2020 10:30 am

Now getting this:

10:26:58:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
10:26:58:WU00:FS01:Trying to send results to collection server
10:26:58:WU00:FS01:Uploading 133.91MiB to 3.21.157.11
10:26:58:WU00:FS01:Connecting to 3.21.157.11:8080
10:27:19:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
10:27:19:WU00:FS01:Connecting to 3.21.157.11:80
10:27:19:WU00:FS01:Upload 0.05%
10:27:29:WU01:FS01:0x22:Completed 650000 out of 1000000 steps (65%)
10:28:31:WU00:FS01:Upload 0.14%
10:28:31:ERROR:WU00:FS01:Exception: Transfer failed

The infrastructure obviously sucks, also it's lame that it is IP based, you should load balance using DNS.

HaloJones · Post by **HaloJones** » Thu Apr 30, 2020 10:38 am

hnapel wrote:Now getting this:

The infrastructure obviously sucks, also it's lame that it is IP based, you should load balance using DNS.

DNS can be spoofed, IP cannot. A single IP could also be dozens of actual servers behind a loadbalancer. Networks can work in many ways.

Neil-B · Post by **Neil-B** » Thu Apr 30, 2020 10:50 am

What is obvious to one person maybe isn't to another … I see a massively evolved infrastructure coordinating the the largest distributed compute resource ever having some teething issues with newly incorporated elements of the infrastructure hosting newly developed styles of Projects … but I accept that might be obviously sucking from some perspectives

Work is ongoing resolving the non-trivial issues causing this and with luck the infrastructure will cease to suck soon

hnapel · Post by **hnapel** » Thu Apr 30, 2020 11:38 am

Still the infrastructure is fragmented: there's the two big assignment servers that process lots but tiny loads, then each project gets a work and collection server, it's not central so you cannot quickly relocate. You can say in theory even IP based addressing can be load balanced, but this project does not do that, so if those two IP's the WU knows about cannot respond, it risks getting the WU timed out and work lost, so that sucks.

HaloJones · Post by **HaloJones** » Thu Apr 30, 2020 11:40 am

hnapel wrote:Still the infrastructure is fragmented: there's the two big assignment servers that process lots but tiny loads, then each project gets a work and collection server, it's not central so you cannot quickly relocate. You can say in theory even IP based addressing can be load balanced, but this project does not do that, so if those two IP's the WU knows about cannot respond, it risks getting the WU timed out and work lost, so that sucks.

Can you send me the infrastructure diagram that you have of how it's all set up? I'd love to see that.

hnapel · Post by **hnapel** » Thu Apr 30, 2020 12:13 pm

It's pretty obvious from this page https://apps.foldingathome.org/serverstats also I don't need replies from fanboys for which no critics is acceptable, I need someone to look at those servers, maybe we have to wait until America wakes up, call Jeff Bezos himself if you like to get those servers going.

HaloJones · Post by **HaloJones** » Thu Apr 30, 2020 12:55 pm

Criticise all you like but you have no idea how it is configured or why it is configured like that. Be constructive and offer some help maybe.

Saying it's "lame"? And you "need someone to look at those servers"?

This is a project where the number of volunteers increased by a factor of twenty but there was no monetary budget increase to go with that. Some additional servers have been donated but the number of IT staff looking after this hasn't changed, the number of scientists generating the work hasn't changed. This isn't a 24/7 commercial organisation.

Neil-B · Post by **Neil-B** » Thu Apr 30, 2020 1:56 pm

If you rail in an unreasonable manner about something you patently know next to nothing about then you have to expect a certain amount of push back. … It isn't about people being fanboys, or not accepting criticism, rather about them trying to help you understand that the simple answer you think you see isn't actually as relevant (at this time) as you think it is because you don't understand the complexity of what you are trying to discuss.

The scale and scope of this compute resource is not trivial and the people who look after the coordination infrastructure do the best with what they have … Yes, there are issues at the moment … Yes, they are impacting many people and the progress of science … Yes, people are working damn hard to try and get them resolved … No, no one is happy with lost science least of all the researchers … No, this isn't a situation that anyone would want to be in with ongoing problems proving extremely difficult to resolve …. and No, saying stuff sucks isn't going to speed this up and get it resolved any quicker - but if you think it helps then feel free to state your opinions just as others might feel free to state contradictory ones

The infrastructure is fragmented - all around the globe - because that is what is available and that is where the science happens - and much of it isn't under centralised "FAH" control but hosted on kit that belongs to the academic institutions and organisations that support FAH … but actually is it not reasonable to expect that a worldwide distributed computing resource is coordinated by a world wide infrastructure?

Oh yes, if you have Jeff's phone number do call him … but actually you may find that technical teams from a number of the large providers are engaged in resolving this and so his team may actually already be helping?

hnapel · Post by **hnapel** » Thu Apr 30, 2020 2:25 pm

It's clear this project scaled too fast and the meagre number of backend servers can't cope with the hubbub, I understand redesigning this thing takes effort but my points are valid: the design is one work server and one collection server based on IP and if they both fail you're toast. Your first line support does actually nothing to fix my issue.

Neil-B · Post by **Neil-B** » Thu Apr 30, 2020 3:01 pm

Actually it is worse than that … the way the science works (at the moment) requires the WU to return to the specific WS it was issued from so that it can be processed and the next generation created - tbh CS just act as an interim buffer for returning WUs and actually slow down the overall science - they were originally intended only to cover WS outages, not to be a regular part of the work flow.

The project has scaled too quickly in many respects … but mostly this scaling has actually been surprisingly effective … and there are suspicions that it is not simply the scaling causing these problems but the nature of a number of the new servers that have been quickly instantiated along with the style of a number of the latest Projects.

Not making excuses but the driver for this rapid expansion, the new style kit and the latest types of projects is Covid-19 … Whilst growing this fast has unearthed a few "niggles" (OK, bloody great problems) the overall throughput has actually been increased by a margin that far, far outweighs the specific losses from issues encountered.

That is of little comfort to everyone desperately trying to help, contribute, deliver research and the current problems are confounding that … and I do truly understand your frustration … Yes, this whole situation sucks, as does the various situations that exist with these project/server/large WU/documentation/installation issues - You are absolutely right … It will get resolved, maybe not before some people run out of patience and give up on FAH, but it will be as soon as the team can possibly make it.

hnapel · Post by **hnapel** » Thu Apr 30, 2020 4:38 pm

Thanks I honestly hope it will be resolved soon for the greater cause of fighting the disease!

ChrisD5710 · Post by **ChrisD5710** » Thu Apr 30, 2020 5:43 pm

Add me to the list:

17:18:07:WU04:FS02:Sending unit results: id:04 state:SEND error:FAULTY project:16435 run:3581 clone:0 gen:4 core:0x22 unit:0x0000000403854c135e9a4ef6eb50b95d
17:18:07:WU04:FS02:Uploading 168.82MiB to 3.133.76.19
17:18:07:WU04:FS02:Connecting to 3.133.76.19:8080
17:18:08:WARNING:WU04:FS02:Exception: Failed to send results to work server: Transfer failed
17:18:08:WU04:FS02:Trying to send results to collection server
17:18:08:WU04:FS02:Uploading 168.82MiB to 3.21.157.11
17:18:08:WU04:FS02:Connecting to 3.21.157.11:8080
17:18:08:ERROR:WU04:FS02:Exception: Transfer failed

Attemps: 13
Next attempt 5 Hours 00 Mins
Timeout: 2020-04-30T21:07:59Z

So this:
11 hours 16 mins

Workunit is going down the drain.

because next upload attempt is way later than the timeout.

Edit: Just checked. Both servers are listed as operative, status OK. ??

BTW. In a video Linus states that You require at least 60 Terabytes of Raid 5 storage and a 1 GigaBit Internet connection in order to set up a server to help You with the workload. Maybe You should consider supporting work servers with less storage?

I would like to help, but as a pensioner, there is NO way I can afford 90 TB of HDD storage.

ChrisD5710

Folding Forum

WU's not being send to work server 3.*

WU's not being send to work server 3.*

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server

Re: WU's not being send to work server 3.*

Re: WU's not being send to work server 3.*