Page 2 of 2

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 3:24 am
by PantherX
durval wrote:...My client is reporting "Connection timed out", so (with all due respect) I beg to differ: it's not some "logical" problem like not assigning new WUs -- on the contrary, it means the IP address assigned to the server is not answering. It could be because the server has crashed (the OS, not the F@H software), it has been powered down, or even (sorry) that its LAN cable has indeed been unplugged. Otherwise we would be seeing other errors like "Connection reset"...
I guess we might be describing the issue from two perspectives. Mine is what you see but you're saying more technical detail.
durval wrote:...
00:26:50:WU04:FS01:Upload complete
00:26:50:WU04:FS01:Server responded WORK_QUIT (404)
00:26:50:WARNING:WU04:FS01:Server did not like results, dumping
Does that mean that it was all in vain? :cry:
Unfortunately, it means that the WU didn't pass the validation test that the Server ran. If it fails the validation test, the results is discarded and you don't get any credits.

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 3:30 am
by PantherX
VAcharonD1 wrote:I'm still timing out when trying to connect. My WU is going to time out soon, too.
Please note that if the completed WU if successfully upload before the Timeout date will get the bonus credits.
If the completed WU if successfully upload before the Expiration date and after the Timeout date, it will get base credits.
Once the WU reaches the Expiration date, it will be automatically be deleted from the client.

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 4:50 am
by VAcharonD1
The estimated credit has dropped by over 20,000 points. I know points are mainly for internet bragging rights, but insofar as they represent the value of one's contribution, the value of this one is dropping rapidly. I hope the project finds a way to route around single points of failure like this.

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 5:08 am
by Sn0wy23
At 6.00 GMT the server still showing as Down and my WUs have no place to go just like others are having issues with.
Hope that the server can besorted out so that the WUs can be uploaded so we don't loose valuable work.

I have had a few "machine did not like the results" over the last 3 days, but still getting and sending WUs when they are available and the server will accept my uploads.
Quick pause of the Slots in the clients happiliy resets the timer so the time between retries are not huge.

Typed with thumbs and with only half hour sleep in last 4 days so excuse the typos and grammer.
Glad it isn't just myself with issues!

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 8:27 am
by PantherX
Good news is that I have just received confirmation that the issue should be resolved for 13.82.98.119 so hopefully, your completed WU will be accepted soon. We appreciate your patience during this :)

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 8:43 am
by Epsilon_Process
PantherX wrote:Good news is that I have just received confirmation that the issue should be resolved for 13.82.98.119 so hopefully, your completed WU will be accepted soon. We appreciate your patience during this :)
Yes, looks good now. I restarted my clients, and all my backlogged WUs were accepted and points were logged. Looks like the server version was updated in the process, to 9.6.7. Thanks!

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 1:50 pm
by durval
PantherX wrote:Good news is that I have just received confirmation that the issue should be resolved for 13.82.98.119 so hopefully, your completed WU will be accepted soon. We appreciate your patience during this :)
Thanks for the feedback, @PantherX.
Epsilon_Process wrote:Yes, looks good now. I restarted my clients, and all my backlogged WUs were accepted and points were logged. Looks like the server version was updated in the process, to 9.6.7. Thanks!
I second that: my other clients were able to upload their pending WUs and log points (albeit severely reduced by losing most of the QRB due to the wait). But all seems normal now.
Sn0wy23 wrote:I have had a few "machine did not like the results" over the last 3 days, but still getting and sending WUs when they are available and the server will accept my uploads.
That sucks :? I just lost that one WU recently (the other one I lost so far was late last week).
Quick pause of the Slots in the clients happiliy resets the timer so the time between retries are not huge.
Interesting. I've been restarting the client (and sometimes losing a few % on the other slot) in order to avoid hours-long retries, as recommended in a sticky post somewhere here in the Forum. Nice to hear that just pausing/unpausing the slots also work, will try that from now on. Thanks for the info, @Sn0wy23.
PantherX wrote:I guess we might be describing the issue from two perspectives. Mine is what you see but you're saying more technical detail.
Yep, I'm a sysadmin by trade, so I tend to see things 'right down to the bare metal'. From an application perspective, what you said is of course correct. BTW, if F@H ever needs sysadmins to take care of those servers, please count me in as a volunteer.
PantherX wrote:Unfortunately, it means that the WU didn't pass the validation test that the Server ran. If it fails the validation test, the results is discarded and you don't get any credits.
:e( This is really disappointing, it was like 5 hours of GPU time and 130K points straight down the toilet :(

@PantherX, why does that happen? Like the other WU I lost last week, this particular client runs on a HP Professsional machine, sporting a Xeon with ECC RAM and a Quadro P5000 GPU, and on Linux (no crappy Windows het), so I'm pretty sure the fault did not happen at my end...

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 4:28 pm
by FoldingStorm
PantherX wrote:Good news is that I have just received confirmation that the issue should be resolved for 13.82.98.119 so hopefully, your completed WU will be accepted soon. We appreciate your patience during this :)
Too late unfortunately, at least it only took like 19 hours.
It continued to fail to upload until about 9:59 UTC today, at which point it appears to have just dropped the WU entirely.
Looks like I wasted all that energy.

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 4:37 pm
by Jan
It's an issue that shouldnt pop up again once the infrastructure has completely adapted to the new scale of users. I feel for your work though. :(

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 10:52 pm
by PantherX
durval wrote:...@PantherX, why does that happen? Like the other WU I lost last week, this particular client runs on a HP Professsional machine, sporting a Xeon with ECC RAM and a Quadro P5000 GPU, and on Linux (no crappy Windows het), so I'm pretty sure the fault did not happen at my end...
In this particular case (where you successfully completed the WU but the Server rejected it), I can think of two reasons (there might be more):
1) Corruption of data while transferring to the Server. If the Server received invalid data, it will simply discard it. There's no retry.
2) Very rarely, a WU on one set of hardware folds fine but on the other set, has just enough corruption to not fail on the system folding it but still fail the validation test on the Server.

Re: Stuck sending results 13.82.98.119

Posted: Thu Apr 16, 2020 11:52 pm
by durval
@PantherX, thanks for the excellent, detailed response.
PantherX wrote:
durval wrote:...@PantherX, why does that happen? Like the other WU I lost last week, this particular client runs on a HP Professsional machine, sporting a Xeon with ECC RAM and a Quadro P5000 GPU, and on Linux (no crappy Windows here), so I'm pretty sure the fault did not happen at my end...
In this particular case (where you successfully completed the WU but the Server rejected it), I can think of two reasons (there might be more):
1) Corruption of data while transferring to the Server. If the Server received invalid data, it will simply discard it. There's no retry.
2) Very rarely, a WU on one set of hardware folds fine but on the other set, has just enough corruption to not fail on the system folding it but still fail the validation test on the Server.
In my experience, reason #1 is very improbable (the transmission is done over TCP, in which every packet has an individual checksum which is verified and causes retransmission of the packet if corrupted, before being sent "up" from the OS to the application).

So I guess we are seeing a case of #2 here.

I do not think it is so rare: over only 241 WUs I've processed so far, it has already happened twice.

If I can be of assistance in finding and resolving the root cause of this, please don't hesitate to contact me.

Thanks again,
-- Durval