Work Unit stuck in "Send" for multiple days

Moderators: Site Moderators, FAHC Science Team

Post Reply
joncrane
Posts: 10
Joined: Mon Oct 19, 2020 6:00 pm

Work Unit stuck in "Send" for multiple days

Post by joncrane »

Hey guys I did a quick search and nothing came up, though there was apparently a problem in April where an owner of a server kinda dropped the ball (viewtopic.php?f=108&t=34768&p=329819&hilit=Waiting+on+Send+Results#p329470). In my case there is a slot perpetually in the "Send" mode and the log messages say

Code: Select all

00:46:32:WU03:FS00:Sending unit results: id:03 state:SEND error:NO_ERROR project:14257 run:0 clone:4022 gen:181 core:0xa7 unit:0x000000d0cedfaa920000000000000fb6
00:46:32:WU03:FS00:Uploading 2.86MiB to 206.223.170.146
00:46:32:WU03:FS00:Connecting to 206.223.170.146:8080
00:46:35:WARNING:WU03:FS00:WorkServer connection failed on port 8080 trying 80
00:46:35:WU03:FS00:Connecting to 206.223.170.146:80
00:46:37:WARNING:WU03:FS00:Exception: Failed to send results to work server: Failed to connect to 206.223.170.146:80: No connection could be made because the target machine actively refused it.
What's going on? :?: Is there any thing I can do to rectify this problem?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Work Unit stuck in "Send" for multiple days

Post by bruce »

No, there's nothing you can do. fah1.innovatr.ca is down and there is no CS defined.

I don't have any useful information about what's going on there.
psaam0001
Posts: 378
Joined: Mon May 18, 2020 2:02 am
Location: Ruckersville, Virginia, USA

Re: Work Unit stuck in "Send" for multiple days

Post by psaam0001 »

Just stay calm, and keep folding.

BTW: We are as of 12:15 PM ET/0715 GMT a little bit over 92% of the way towards completing Sprint #5. :D

Paul
joncrane
Posts: 10
Joined: Mon Oct 19, 2020 6:00 pm

Re: Work Unit stuck in "Send" for multiple days

Post by joncrane »

bruce wrote:No, there's nothing you can do. fah1.innovatr.ca is down and there is no CS defined.

I don't have any useful information about what's going on there.
Thank you, bruce.
jucohen
Posts: 4
Joined: Wed Mar 04, 2020 1:33 pm

Re: Work Unit stuck in "Send" for multiple days

Post by jucohen »

Hey just catching this one up.

I am the sysadmin here at Cisco Systems that administers this data centre/hardware on behalf of the F@H team. We had electrical power maintenance in the building, and for safety reasons we had to power the system off for a few days, due to COVID restrictions we then had to wait to get it back on.

Sorry for that, we keep this box up as close to 100% of the time as we can, but, with the performance it has demonstrated throughout it's life, it seems to hand out work units super quick.
Justin Cohen
FAH1.INNOVATR.CA Hardware Owner/Admin
Innovation Architect
Toronto Innovation Lab
Cisco Systems
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Work Unit stuck in "Send" for multiple days

Post by bruce »

Was this planned maintenance or emergency repairs? It would have been nice to have received notice here and/or FAH's social sites.

Even with a short notice, the project owner could have configured Collection Server(s) in an alternate domain and some results could have been returned sooner.
jucohen
Posts: 4
Joined: Wed Mar 04, 2020 1:33 pm

Re: Work Unit stuck in "Send" for multiple days

Post by jucohen »

We do notify the FAH team when we need to do some work, sometimes it just makes sense to leave things how they are, other than start reconfiguring things or trying to move TB's of data around.
Justin Cohen
FAH1.INNOVATR.CA Hardware Owner/Admin
Innovation Architect
Toronto Innovation Lab
Cisco Systems
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Work Unit stuck in "Send" for multiple days

Post by bruce »

No data would need to be moved ... and I'd certainly have recommended against that anyway. Collection Servers should be established whenever possible, and long enough before the outage for that information to be included with new all assignments.

The function of a Collection Server is to accept completed WUs on behalf of the primary Work Server and forward them to said server when it comes back up. At this moment, that server has a CS configured as 128.252.203.10. I do not know when that configuration was established, but all WUs distributed after it was should be automatically managed by the FAH system.

If the WU was distributed before the CS was established, we can expect that the WU will retry until the Work Server comes back up as you've described elsewhere.
Post Reply