Question about AMD GPU and checkpoints
Posted: Wed Apr 29, 2020 12:22 pm
Has anyone recovered from an AMD GPU checkpoint AND successfully submitted the result?
I can pause or exit FAH, and then restart it. The log file will show the GPU falling back to the checkpoint. It will continue until 100%. However, it will fail at the very end and not send.
It will not send the file.
If I never pause or stop so a checkpoint is not used, thing work fine
Here is a successful end
Some of these jobs are almost 8 hours long. It is frustrating that it looks like it completes, but does not send.
So I ask again, does anyone have a log file for an AMD GPU WU that shows recovery from checkpoint AND successfully sending result and getting credit?
Thanks for any information
I can pause or exit FAH, and then restart it. The log file will show the GPU falling back to the checkpoint. It will continue until 100%. However, it will fail at the very end and not send.
Code: Select all
20:15:12:WU00:FS00:0x22:Completed 2000000 out of 2000000 steps (100%)
20:15:14:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
20:15:14:WU00:FS00:0x22:Saving result file checkpointState.xml
20:15:14:WU00:FS00:0x22:Saving result file checkpt.crc
20:15:14:WU00:FS00:0x22:Saving result file positions.xtc
20:15:14:WU00:FS00:0x22:Saving result file science.log
20:15:14:WU00:FS00:0x22:Folding@home Core Shutdown: FINISHED_UNIT
20:15:16:WARNING:WU00:FS00:FahCore returned an unknown error code which probably indicates that it crashed
20:15:16:WARNING:WU00:FS00:FahCore returned: UNKNOWN_ENUM (-1073740940 = 0xc0000374)
If I never pause or stop so a checkpoint is not used, thing work fine
Here is a successful end
Code: Select all
01:03:48:WU00:FS00:0x22:Completed 990000 out of 1000000 steps (99%)
01:03:48:WU01:FS00:Connecting to 65.254.110.245:80
01:03:48:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:48:WU01:FS00:Connecting to 18.218.241.186:80
01:03:49:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:49:WU01:FS00:Connecting to 65.254.110.245:80
01:03:49:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:49:WU01:FS00:Connecting to 18.218.241.186:80
01:03:49:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:49:ERROR:WU01:FS00:Exception: Could not get an assignment
01:03:49:WU01:FS00:Connecting to 65.254.110.245:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:50:WU01:FS00:Connecting to 18.218.241.186:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:50:WU01:FS00:Connecting to 65.254.110.245:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:03:50:WU01:FS00:Connecting to 18.218.241.186:80
01:03:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:03:50:ERROR:WU01:FS00:Exception: Could not get an assignment
01:04:49:WU01:FS00:Connecting to 65.254.110.245:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:04:50:WU01:FS00:Connecting to 18.218.241.186:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:04:50:WU01:FS00:Connecting to 65.254.110.245:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:04:50:WU01:FS00:Connecting to 18.218.241.186:80
01:04:50:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:04:50:ERROR:WU01:FS00:Exception: Could not get an assignment
01:05:40:WU00:FS00:0x22:Completed 1000000 out of 1000000 steps (100%)
01:05:43:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
01:05:43:WU00:FS00:0x22:Saving result file checkpointState.xml
01:05:44:WU00:FS00:0x22:Saving result file checkpt.crc
01:05:44:WU00:FS00:0x22:Saving result file positions.xtc
01:05:44:WU00:FS00:0x22:Saving result file science.log
01:05:44:WU00:FS00:0x22:Folding@home Core Shutdown: FINISHED_UNIT
01:05:44:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
01:05:44:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:11762 run:0 clone:8302 gen:23 core:0x22 unit:0x0000003180fccb0a5e7113ccda4722a1
01:05:44:WU00:FS00:Uploading 33.02MiB to 128.252.203.10
01:05:44:WU00:FS00:Connecting to 128.252.203.10:8080
01:06:00:WU00:FS00:Upload 0.19%
01:06:06:WU00:FS00:Upload 0.38%
01:06:27:WU01:FS00:Connecting to 65.254.110.245:80
01:06:27:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:06:27:WU01:FS00:Connecting to 18.218.241.186:80
01:06:27:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:06:27:WU01:FS00:Connecting to 65.254.110.245:80
01:06:27:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
01:06:27:WU01:FS00:Connecting to 18.218.241.186:80
01:06:28:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:06:28:ERROR:WU01:FS00:Exception: Could not get an assignment
01:07:04:WU00:FS00:Upload 0.95%
01:07:10:WU00:FS00:Upload 11.74%
01:07:16:WU00:FS00:Upload 24.80%
01:07:22:WU00:FS00:Upload 36.91%
01:07:28:WU00:FS00:Upload 49.21%
01:07:34:WU00:FS00:Upload 61.51%
01:07:40:WU00:FS00:Upload 74.01%
01:07:46:WU00:FS00:Upload 86.69%
01:07:52:WU00:FS00:Upload 96.53%
01:07:54:WU00:FS00:Upload complete
01:07:54:WU00:FS00:Server responded WORK_ACK (400)
01:07:54:WU00:FS00:Final credit estimate, 51365.00 points
01:07:54:WU00:FS00:Cleaning up
Some of these jobs are almost 8 hours long. It is frustrating that it looks like it completes, but does not send.
So I ask again, does anyone have a log file for an AMD GPU WU that shows recovery from checkpoint AND successfully sending result and getting credit?
Thanks for any information