Page 1 of 1

Work Unit stuck in "Send" for multiple days

Posted: Tue Nov 24, 2020 12:52 am
by joncrane
Hey guys I did a quick search and nothing came up, though there was apparently a problem in April where an owner of a server kinda dropped the ball (viewtopic.php?f=108&t=34768&p=329819&hilit=Waiting+on+Send+Results#p329470). In my case there is a slot perpetually in the "Send" mode and the log messages say

Code: Select all

00:46:32:WU03:FS00:Sending unit results: id:03 state:SEND error:NO_ERROR project:14257 run:0 clone:4022 gen:181 core:0xa7 unit:0x000000d0cedfaa920000000000000fb6
00:46:32:WU03:FS00:Uploading 2.86MiB to 206.223.170.146
00:46:32:WU03:FS00:Connecting to 206.223.170.146:8080
00:46:35:WARNING:WU03:FS00:WorkServer connection failed on port 8080 trying 80
00:46:35:WU03:FS00:Connecting to 206.223.170.146:80
00:46:37:WARNING:WU03:FS00:Exception: Failed to send results to work server: Failed to connect to 206.223.170.146:80: No connection could be made because the target machine actively refused it.
What's going on? :?: Is there any thing I can do to rectify this problem?

Re: Work Unit stuck in "Send" for multiple days

Posted: Tue Nov 24, 2020 1:50 am
by bruce
No, there's nothing you can do. fah1.innovatr.ca is down and there is no CS defined.

I don't have any useful information about what's going on there.

Re: Work Unit stuck in "Send" for multiple days

Posted: Tue Nov 24, 2020 5:21 am
by psaam0001
Just stay calm, and keep folding.

BTW: We are as of 12:15 PM ET/0715 GMT a little bit over 92% of the way towards completing Sprint #5. :D

Paul

Re: Work Unit stuck in "Send" for multiple days

Posted: Tue Nov 24, 2020 4:14 pm
by joncrane
bruce wrote:No, there's nothing you can do. fah1.innovatr.ca is down and there is no CS defined.

I don't have any useful information about what's going on there.
Thank you, bruce.

Re: Work Unit stuck in "Send" for multiple days

Posted: Tue Jan 05, 2021 4:43 pm
by jucohen
Hey just catching this one up.

I am the sysadmin here at Cisco Systems that administers this data centre/hardware on behalf of the F@H team. We had electrical power maintenance in the building, and for safety reasons we had to power the system off for a few days, due to COVID restrictions we then had to wait to get it back on.

Sorry for that, we keep this box up as close to 100% of the time as we can, but, with the performance it has demonstrated throughout it's life, it seems to hand out work units super quick.

Re: Work Unit stuck in "Send" for multiple days

Posted: Wed Jan 06, 2021 7:59 am
by bruce
Was this planned maintenance or emergency repairs? It would have been nice to have received notice here and/or FAH's social sites.

Even with a short notice, the project owner could have configured Collection Server(s) in an alternate domain and some results could have been returned sooner.

Re: Work Unit stuck in "Send" for multiple days

Posted: Thu Jan 21, 2021 7:32 pm
by jucohen
We do notify the FAH team when we need to do some work, sometimes it just makes sense to leave things how they are, other than start reconfiguring things or trying to move TB's of data around.

Re: Work Unit stuck in "Send" for multiple days

Posted: Thu Jan 21, 2021 9:48 pm
by bruce
No data would need to be moved ... and I'd certainly have recommended against that anyway. Collection Servers should be established whenever possible, and long enough before the outage for that information to be included with new all assignments.

The function of a Collection Server is to accept completed WUs on behalf of the primary Work Server and forward them to said server when it comes back up. At this moment, that server has a CS configured as 128.252.203.10. I do not know when that configuration was established, but all WUs distributed after it was should be automatically managed by the FAH system.

If the WU was distributed before the CS was established, we can expect that the WU will retry until the Work Server comes back up as you've described elsewhere.