Page 1 of 1

completed task wont send

Posted: Tue May 19, 2020 9:16 am
by Badsinger
Hi folks. I have a completed task that wont send. It gets to 1.84 %, oddly the time I hit finish in FAHControl, and fails. I'll paste some of the log.

Code: Select all

08:31:36:WU00:FS01:0x22:Completed 1980000 out of 2000000 steps (99%)
08:35:12:WU00:FS01:0x22:Completed 2000000 out of 2000000 steps (100%)
08:35:16:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
08:35:16:WU00:FS01:0x22:Saving result file checkpointState.xml
08:35:19:WU00:FS01:0x22:Saving result file checkpt.crc
08:35:19:WU00:FS01:0x22:Saving result file positions.xtc
08:35:21:WU00:FS01:0x22:Saving result file science.log
08:35:21:WU00:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
08:35:21:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
08:35:21:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14438 run:0 clone:29 gen:19 core:0x22 unit:0x0000001a03854c135e9fc39c6a25c27f
08:35:21:WU00:FS01:Uploading 78.05MiB to 3.133.76.19
08:35:21:WU00:FS01:Connecting to 3.133.76.19:8080
08:35:27:WU00:FS01:Upload 1.36%
08:35:28:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
08:35:28:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14438 run:0 clone:29 gen:19 core:0x22 unit:0x0000001a03854c135e9fc39c6a25c27f
08:35:28:WU00:FS01:Uploading 78.05MiB to 3.133.76.19
08:35:28:WU00:FS01:Connecting to 3.133.76.19:8080
08:35:34:WU00:FS01:Upload 1.84%
08:35:35:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
08:35:43:FS01:Paused
08:35:48:Removing old file 'configs/config-20200513-023020.xml'
08:35:48:Saving configuration to config.xml
08:35:48:<config>
08:35:48:  <!-- Folding Slot Configuration -->
08:35:48:  <max-packet-size v='10'/>
08:35:48:
08:35:48:  <!-- Network -->
08:35:48:  <proxy v=':8080'/>
08:35:48:
08:35:48:  <!-- Slot Control -->
08:35:48:  <pause-on-start v='true'/>
08:35:48:
08:35:48:  <!-- User Information -->
08:35:48:  <passkey v='*****'/>
08:35:48:  <team v='76140'/>
08:35:48:  <user v='Badsinger'/>
08:35:48:
08:35:48:  <!-- Work Unit Control -->
08:35:48:  <next-unit-percentage v='100'/>
08:35:48:
08:35:48:  <!-- Folding Slots -->
08:35:48:  <slot id='1' type='GPU'>
08:35:48:    <paused v='true'/>
08:35:48:  </slot>
08:35:48:</config>
08:36:29:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14438 run:0 clone:29 gen:19 core:0x22 unit:0x0000001a03854c135e9fc39c6a25c27f
08:36:29:WU00:FS01:Uploading 78.05MiB to 3.133.76.19
08:36:29:WU00:FS01:Connecting to 3.133.76.19:8080
08:36:35:WU00:FS01:Upload 1.84%
08:36:39:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
08:38:06:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14438 run:0 clone:29 gen:19 core:0x22 unit:0x0000001a03854c135e9fc39c6a25c27f
08:38:06:WU00:FS01:Uploading 78.05MiB to 3.133.76.19
08:38:06:WU00:FS01:Connecting to 3.133.76.19:8080
08:38:12:WU00:FS01:Upload 1.84%
08:38:13:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
08:40:43:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14438 run:0 clone:29 gen:19 core:0x22 unit:0x0000001a03854c135e9fc39c6a25c27f
08:40:43:WU00:FS01:Uploading 78.05MiB to 3.133.76.19
08:40:43:WU00:FS01:Connecting to 3.133.76.19:8080
08:40:49:WU00:FS01:Upload 1.84%
08:40:50:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
08:44:57:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14438 run:0 clone:29 gen:19 core:0x22 unit:0x0000001a03854c135e9fc39c6a25c27f
08:44:57:WU00:FS01:Uploading 78.05MiB to 3.133.76.19
08:44:57:WU00:FS01:Connecting to 3.133.76.19:8080
08:45:03:WU00:FS01:Upload 1.52%
08:45:05:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
The client is 7.6.13 in windows 7. GPU is an Nvidia 1660 TI. This last bit is the 1st part of the log

02:38:08:WU00:FS01:0x22:Reading tar file core.xml
02:38:08:WU00:FS01:0x22:Reading tar file integrator.xml
02:38:08:WU00:FS01:0x22:Reading tar file state.xml
02:38:08:WU00:FS01:0x22:Reading tar file system.xml
02:38:08:WU00:FS01:0x22:Digital signatures verified
02:38:08:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
02:38:08:WU00:FS01:0x22:Version 0.0.5
02:38:23:WU00:FS01:0x22:Completed 0 out of 2000000 steps (0%)
02:38:23:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
02:41:54:WU00:FS01:0x22:Completed 20000 out of 2000000 steps (1%)
02:43:23:FS01:Finishing
02:45:28:WU00:FS01:0x22:Completed 40000 out of 2000000 steps (2%) Thanks for the help.

Mod Edit: Added Code Tags - PantherX

Re: completed task wont send

Posted: Tue May 19, 2020 9:20 am
by PantherX
Welcome to the F@H Forum Badsinger,

Please note that Server 3.133.76.19 is under load so it might take a bit of time for the upload to happen. The best thing to do is to leave your client folding and it will automatically attempt to upload :)

Re: completed task wont send

Posted: Tue May 19, 2020 8:59 pm
by G3WGV
PantherX wrote:Please note that Server 3.133.76.19 is under load so it might take a bit of time for the upload to happen.
Very noticeably so, my client just took 17 minutes to upload 64Mb to 3.133.76.19 with an uplink speed of 20Mbit/s. Is there no load sharing mechanism or are all the servers struggling? I seem to hit 3.133.76.19 every time. What would be very interesting would be some idea of the FAH server architecture, to better understand what is going on.

Re: completed task wont send

Posted: Tue May 19, 2020 9:04 pm
by Neil-B
As I understand it each Project will dispense its WUs from one Work Server and the results need to get back to that same WS (sometimes via a Collection Server) ... So in a way a CS can load balance but the WUs still have to get back to the WS for the next WU to be reissued.

The Fireside Dev chat that was recorded gives a fair overview of the infrastructure and some of the challenges ... https://stanford.zoom.us/rec/play/7pV-d ... 6462356000

Re: completed task wont send

Posted: Tue May 19, 2020 10:25 pm
by G3WGV
Thanks Neil. I'm guessing that fireside chat is fairly recent but can't seem to find a date stamp. I didn't know they happened so I'll work my way through this one and keep an eye out for others.

By inspection I can see that WUs are indeed uploaded to the issuing WS (or its CS - that seems to happen a lot) but the client starts requesting the next WU when the previous one is at 99% complete, so there does not appear to be any concept of waiting for completion before issuing a new WU. When things get bogged down I might be half way through the next WU assignment before the previous one has completed uploading!

Re: completed task wont send

Posted: Wed May 20, 2020 2:54 am
by bruce
You can set NEXT-UNIT-PERCENTAGE to 100 in the advanced control application. The default of 99 was established when there were lots of WUs that needed work and Donors complained that their machine was idle (briefly) between WUs.

As for it getting bogged down, I don't know if this will help. That depends mostly on how busy the server is (or the CS, if one is operable).

Re: completed task wont send

Posted: Wed May 20, 2020 4:36 am
by PantherX
G3WGV wrote:...I'm guessing that fireside chat is fairly recent but can't seem to find a date stamp. I didn't know they happened so I'll work my way through this one and keep an eye out for others...
It was on 9 Thursday April 2020: viewtopic.php?f=16&t=34136

Re: completed task wont send

Posted: Wed May 20, 2020 1:46 pm
by NBR
bruce wrote:You can set NEXT-UNIT-PERCENTAGE to 100 in the advanced control application. The default of 99 was established when there were lots of WUs that needed work and Donors complained that their machine was idle (briefly) between WUs.

As for it getting bogged down, I don't know if this will help. That depends mostly on how busy the server is (or the CS, if one is operable).
I have changed mine to 100, I think it is worth it due to the bonus points.

Re: completed task wont send

Posted: Wed May 20, 2020 2:25 pm
by Neil-B
For me I tend to use the 99% default ... for the most part my tpfs are less than 60secs so not too much less of points/delay to the science, allows for a couple of failed assignment attempts and still get a new WU in time for the client to immediately work on next so that the cpus are kept under load and dont cool minimising heat cycling ... but I have the luxury of fairly high count CPU slots