Page 1 of 1

Work Unit Upload Failure

Posted: Thu Mar 19, 2020 8:38 pm
by tau2pi4u
A work unit the CPU on one of my machines completed has been failing to upload. I waited to see if it'd fix itself eventually, but it's been over a day. Other work units have successfully completed and been uploaded since this one finished.

I'm on version 7.5.1 and this machine has an i7 2600K and a GTX 1060 6GB.

The work unit information is as follows:

Code: Select all

PRCG 11758 (0, 2069, 0)
Slot ID: 1
Work ID: 02
Status: Send
Progress: 100%
FahCore 0x22
Waiting on: Send results
Attempts: 29
Assigned 2020-03-17T17:12:45Z
Timeout 2020-03-18T17:12:45Z
Expiration 2020-03-25T22:00:44Z
Work Server: 155.247.164.213
Collection Server: 155.247.164.214
This should be the relevant bit of the log beneath. Full log is too long to post.

Code: Select all

21:14:21:WU00:FS00:0xa7:*********************** Log Started 2020-03-17T21:14:21Z ***********************
21:14:21:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
21:14:21:WU00:FS00:0xa7:       Type: 0xa7
21:14:21:WU00:FS00:0xa7:       Core: Gromacs
21:14:21:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 14416 -checkpoint 15 -np
21:14:21:WU00:FS00:0xa7:             7
21:14:21:WU00:FS00:0xa7:************************************ CBang *************************************
21:14:21:WU00:FS00:0xa7:       Date: Oct 26 2019
21:14:21:WU00:FS00:0xa7:       Time: 01:38:25
21:14:21:WU00:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
21:14:21:WU00:FS00:0xa7:     Branch: master
21:14:21:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
21:14:21:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
21:14:21:WU00:FS00:0xa7:   Platform: win32 10
21:14:21:WU00:FS00:0xa7:       Bits: 64
21:14:21:WU00:FS00:0xa7:       Mode: Release
21:14:21:WU00:FS00:0xa7:************************************ System ************************************
21:14:21:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
21:14:21:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
21:14:21:WU00:FS00:0xa7:       CPUs: 8
21:14:21:WU00:FS00:0xa7:     Memory: 15.97GiB
21:14:21:WU00:FS00:0xa7:Free Memory: 10.17GiB
21:14:21:WU00:FS00:0xa7:    Threads: WINDOWS_THREADS
21:14:21:WU00:FS00:0xa7: OS Version: 6.2
21:14:21:WU00:FS00:0xa7:Has Battery: false
21:14:21:WU00:FS00:0xa7: On Battery: false
21:14:21:WU00:FS00:0xa7: UTC Offset: 0
21:14:21:WU00:FS00:0xa7:        PID: 13208
21:14:21:WU00:FS00:0xa7:        CWD: C:\Users\Will\AppData\Roaming\FAHClient\work
21:14:21:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
21:14:21:WU00:FS00:0xa7:    Version: 0.0.18
21:14:21:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:14:21:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
21:14:21:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
21:14:21:WU00:FS00:0xa7:       Date: Oct 26 2019
21:14:21:WU00:FS00:0xa7:       Time: 01:52:30
21:14:21:WU00:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
21:14:21:WU00:FS00:0xa7:     Branch: master
21:14:21:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
21:14:21:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
21:14:21:WU00:FS00:0xa7:   Platform: win32 10
21:14:21:WU00:FS00:0xa7:       Bits: 64
21:14:21:WU00:FS00:0xa7:       Mode: Release
21:14:21:WU00:FS00:0xa7:************************************ Build *************************************
21:14:21:WU00:FS00:0xa7:       SIMD: avx_256
21:14:21:WU00:FS00:0xa7:********************************************************************************
21:14:21:WU00:FS00:0xa7:Project: 14328 (Run 4, Clone 5506, Gen 6)
21:14:21:WU00:FS00:0xa7:Unit: 0x000000079bf7a4d65e6d0fdd6b722c0a
21:14:21:WU00:FS00:0xa7:Reading tar file core.xml
21:14:21:WU00:FS00:0xa7:Reading tar file frame6.tpr
21:14:21:WU00:FS00:0xa7:Digital signatures verified
21:14:21:WU00:FS00:0xa7:Reducing thread count from 7 to 6 to avoid domain decomposition by a prime number > 3
21:14:21:WU00:FS00:0xa7:Calling: mdrun -s frame6.tpr -o frame6.trr -cpt 15 -nt 6
21:14:22:WU00:FS00:0xa7:Steps: first=1500000 total=250000
[deleted to reduce length]
22:46:44:WU02:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
22:46:55:WU02:FS01:0x22:Saving result file ..\logfile_01.txt
22:46:55:WU02:FS01:0x22:Saving result file checkpointState.xml
22:46:56:WU02:FS01:0x22:Saving result file checkpt.crc
22:46:56:WU02:FS01:0x22:Saving result file positions.xtc
22:46:56:WU02:FS01:0x22:Saving result file science.log
22:46:56:WU02:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
22:46:57:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
22:46:57:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:11758 run:0 clone:2069 gen:0 core:0x22 unit:0x000000029bf7a4d55e6d7714cf5c1f2e
22:46:57:WU02:FS01:Uploading 55.24MiB to 155.247.164.213
22:46:57:WU02:FS01:Connecting to 155.247.164.213:8080
22:46:58:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
22:46:58:WU02:FS01:Trying to send results to collection server
22:46:58:WU02:FS01:Uploading 55.24MiB to 155.247.164.214
22:46:58:WU02:FS01:Connecting to 155.247.164.214:8080
22:46:58:ERROR:WU02:FS01:Exception: Transfer failed
22:46:58:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:11758 run:0 clone:2069 gen:0 core:0x22 unit:0x000000029bf7a4d55e6d7714cf5c1f2e
22:46:58:WU02:FS01:Uploading 55.24MiB to 155.247.164.213
22:46:58:WU02:FS01:Connecting to 155.247.164.213:8080
22:46:59:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
It's been continually failing in the same way, this is from today (2020-03-19)

Code: Select all

20:03:54:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:11758 run:0 clone:2069 gen:0 core:0x22 unit:0x000000029bf7a4d55e6d7714cf5c1f2e
20:03:54:WU02:FS01:Uploading 55.24MiB to 155.247.164.213
20:03:54:WU02:FS01:Connecting to 155.247.164.213:8080
20:03:55:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
20:03:55:WU02:FS01:Trying to send results to collection server
20:03:55:WU02:FS01:Uploading 55.24MiB to 155.247.164.214
20:03:55:WU02:FS01:Connecting to 155.247.164.214:8080
20:03:55:ERROR:WU02:FS01:Exception: Transfer failed
If you do need the full log I have it saved but I'd need to send it over multiple posts because it's ~3.5x the character limit.

Re: Work Unit Upload Failure

Posted: Thu Mar 19, 2020 9:14 pm
by Jesse_V
Yeah, I'm guessing that's due to the flood of new users and the high demand on the servers at the moment. The developers and research teams are currently focused on getting the work servers back up and capable of meeting demand. I expect that this will help sort out these issues with uploading workunits.

Re: Work Unit Upload Failure

Posted: Thu Mar 19, 2020 9:33 pm
by Joe_H
There is also a known issue with these servers and several projects hosted there, they are aware and looking into it.

Re: Work Unit Upload Failure

Posted: Fri Mar 20, 2020 2:12 pm
by tau2pi4u
Thanks for the help! If it's already known then I'll just leave the machine running, which I was planning on doing anyway. Good job by all the devs/researchers to scale up and deal with all of this - it's gotta be a lot of work.