I'm new to Folding@home, so I apologize if the solution here is obvious/trivial.
For about the past ~15 hours maybe, I have 2 FAULTY WUs that are failing to upload.
All NO_ERROR WUs continue to upload just fine (with the occasional hiccup, of course, due to the unusually high volume). I have uploaded ~20 something NO_ERROR WUs today, so only these 2 faulty WUs seem to be affected.
I would like to be able to somehow drop these FAULTY WUs (with or without credit) so that somebody else can start working on them. Right now, I appear to just be holding them (until they time out, I suppose).
FAULTY WUs that won't upload:
11758 (0, 3765, 0)
11759 (0, 10513, 1)
I have checked the WU Status for each of these units (above links) and it looks like nothing has made it back to the collection servers.
A Project search for 11758 says Project Unspecified.
A Project search for 11759 leads to a valid covid project.
In most instances upload fails immediately:
Code: Select all
18:37:33:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:11758 run:0 clone:3765 gen:0 core:0x22 unit:0x000000009bf7a4d55e6d771ae2d300da
18:37:33:WU00:FS00:Uploading 53.95MiB to 155.247.164.213
18:37:33:WU00:FS00:Connecting to 155.247.164.213:8080
18:37:33:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
18:37:33:WU00:FS00:Trying to send results to collection server
18:37:33:WU00:FS00:Uploading 53.95MiB to 155.247.164.214
18:37:33:WU00:FS00:Connecting to 155.247.164.214:8080
18:37:33:ERROR:WU00:FS00:Exception: Transfer failed
Code: Select all
18:17:11:WARNING:WU02:FS01:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: Connection timed out
18:17:11:WU02:FS01:Trying to send results to collection server
18:17:11:WU02:FS01:Uploading 154.15MiB to 155.247.166.220
18:17:11:WU02:FS01:Connecting to 155.247.166.220:8080
18:17:13:ERROR:WU02:FS01:Exception: Transfer failed
18:17:13:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11759 run:0 clone:10513 gen:1 core:0x22 unit:0x0000000180fccb0a5e6eb0329c88c5ba
18:17:13:WU02:FS01:Uploading 154.15MiB to 128.252.203.10
18:17:13:WU02:FS01:Connecting to 128.252.203.10:8080
18:17:52:WU02:FS01:Upload 0.04%
18:18:34:WU02:FS01:Upload 0.20%
18:18:35:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
18:18:35:WU02:FS01:Trying to send results to collection server
18:18:35:WU02:FS01:Uploading 154.15MiB to 155.247.166.220
18:18:35:WU02:FS01:Connecting to 155.247.166.220:8080
18:18:35:ERROR:WU02:FS01:Exception: Transfer failed
I have paused/restarted folding and restarted the FAHClient service on this node a few times, but to no avail.
Any advice would be greatly appreciated. Is the best course of action to just hold these failed WUs until they time out (or hopefully upload)?
Thanks in advance!
---------------------------------------
Additional logs regarding actual computation failure follow, in case anyone is interested:
Failure Log for 11758 (0, 3765, 0)
Code: Select all
Project: 11758 (Run 0, Clone 3765, Gen 0)
Unit: 0x000000009bf7a4d55e6d771ae2d300da
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.2
Found a checkpoint file
Completed 650000 out of 1000000 steps (65%)
Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
Completed 660000 out of 1000000 steps (66%)
Completed 670000 out of 1000000 steps (67%)
Completed 680000 out of 1000000 steps (68%)
Completed 690000 out of 1000000 steps (69%)
Completed 700000 out of 1000000 steps (70%)
Completed 710000 out of 1000000 steps (71%)
Completed 720000 out of 1000000 steps (72%)
Completed 730000 out of 1000000 steps (73%)
Completed 740000 out of 1000000 steps (74%)
Completed 750000 out of 1000000 steps (75%)
Completed 760000 out of 1000000 steps (76%)
Completed 770000 out of 1000000 steps (77%)
Completed 780000 out of 1000000 steps (78%)
Completed 790000 out of 1000000 steps (79%)
Completed 800000 out of 1000000 steps (80%)
Completed 810000 out of 1000000 steps (81%)
Completed 820000 out of 1000000 steps (82%)
Completed 830000 out of 1000000 steps (83%)
Completed 840000 out of 1000000 steps (84%)
Completed 850000 out of 1000000 steps (85%)
Completed 860000 out of 1000000 steps (86%)
Completed 870000 out of 1000000 steps (87%)
Completed 880000 out of 1000000 steps (88%)
Completed 890000 out of 1000000 steps (89%)
Completed 900000 out of 1000000 steps (90%)
Completed 910000 out of 1000000 steps (91%)
Completed 920000 out of 1000000 steps (92%)
Completed 930000 out of 1000000 steps (93%)
Completed 940000 out of 1000000 steps (94%)
Completed 950000 out of 1000000 steps (95%)
Caught signal SIGABRT(6) on PID 13851
WARNING:Unexpected exit from science code
Saving result file ../logfile_01.txt
Saving result file checkpointState.xml
Saving result file checkpt.crc
Saving result file positions.xtc
Saving result file science.log
Folding@home Core Shutdown: BAD_WORK_UNIT
Failure Log for 11759 (0, 10513, 1)
Code: Select all
Unit: 0x0000000180fccb0a5e6eb0329c88c5ba
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml
Reading tar file system.xml
Digital signatures verified
Folding@home GPU Core22 Folding@home Core
Version 0.0.2
Completed 0 out of 1000000 steps (0%)
Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
Completed 10000 out of 1000000 steps (1%)
Completed 20000 out of 1000000 steps (2%)
Completed 30000 out of 1000000 steps (3%)
Completed 40000 out of 1000000 steps (4%)
Completed 50000 out of 1000000 steps (5%)
Completed 60000 out of 1000000 steps (6%)
Completed 70000 out of 1000000 steps (7%)
Completed 80000 out of 1000000 steps (8%)
Completed 90000 out of 1000000 steps (9%)
Completed 100000 out of 1000000 steps (10%)
Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
Following exception occured: Force RMSE error of 5.23885 with threshold of 5
Completed 60000 out of 1000000 steps (6%)
Completed 70000 out of 1000000 steps (7%)
Completed 80000 out of 1000000 steps (8%)
Completed 90000 out of 1000000 steps (9%)
Completed 100000 out of 1000000 steps (10%)
Completed 110000 out of 1000000 steps (11%)
Completed 120000 out of 1000000 steps (12%)
Completed 130000 out of 1000000 steps (13%)
Completed 140000 out of 1000000 steps (14%)
Completed 150000 out of 1000000 steps (15%)
Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
Following exception occured: Force RMSE error of 5.29797 with threshold of 5
Completed 110000 out of 1000000 steps (11%)
Completed 120000 out of 1000000 steps (12%)
Completed 130000 out of 1000000 steps (13%)
Completed 140000 out of 1000000 steps (14%)
Completed 150000 out of 1000000 steps (15%)
Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
Following exception occured: Force RMSE error of 5.31428 with threshold of 5
ERROR:114: Max Retries Reached
Saving result file ../logfile_01.txt
Saving result file badstate-0.xml
Saving result file badstate-1.xml
Saving result file badstate-2.xml
Saving result file checkpointState.xml
Saving result file checkpt.crc
Saving result file positions.xtc
Saving result file science.log
Folding@home Core Shutdown: BAD_WORK_UNIT