Perhaps another valuable data point:
Since August 3rd, my RX580 didn't manage to return ANY assigned WU successfully.
Before that date, it was already processing project 13421 exclusively for days. Usually successfully, with some odd streaks of faulting several WUs in a row with things like:
'Force RMSE error of 22.695 with threshold of 5'
'Potential energy error of 46.9965, threshold of 10'
'An exception occurred at step 224142: Particle coordinate is nan'
'NaNs detected in forces. 0 0'
'Discrepancy: Forces are blowing up! 0 0'
And on August 3rd, coincidentally(?) when project 16600 joined the mix, not a
single RX580 WU completed anymore on that rig, same mix of error messages as above, but just for every single WU (which all happen to be of either project 13421 or 16600)
Code: Select all
******************************* Date: 2020-08-03 *******************************
22:09:54:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:7887 clone:21 gen:0 core:0x22 unit:0x0000000112bc7d9a5f26fb4f3a86697f
22:11:20:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13421 run:8140 clone:12 gen:0 core:0x22 unit:0x0000000012bc7d9a5f284d8be4253e2e
22:11:21:WU00:FS01:Final credit estimate, 12664.00 points
23:07:18:WU02:FS01:0x22:ERROR:Potential energy error of 12.5142, threshold of 10
23:07:18:WU02:FS01:0x22:ERROR:Reference Potential Energy: -56187.4 | Given Potential Energy: -56200
23:07:18:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:7887 clone:21 gen:0 core:0x22 unit:0x0000000112bc7d9a5f26fb4f3a86697f
23:07:19:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13421 run:4773 clone:26 gen:0 core:0x22 unit:0x0000000312bc7d9a5f20bd44a8a60c11
23:07:28:WU00:FS01:0x22:ERROR:Potential energy error of 46.9965, threshold of 10
23:07:28:WU00:FS01:0x22:ERROR:Reference Potential Energy: -57187 | Given Potential Energy: -57234
23:07:28:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:4773 clone:26 gen:0 core:0x22 unit:0x0000000312bc7d9a5f20bd44a8a60c11
23:07:57:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:391 gen:243 core:0x22 unit:0x0000010f8f59f36f5ec36912d651e428
******************************* Date: 2020-08-03 *******************************
23:48:35:WU02:FS01:0x22:An exception occurred at step 84335: Particle coordinate is nan
23:48:35:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
23:50:34:WU02:FS01:0x22:An exception occurred at step 76303: Particle coordinate is nan
23:50:34:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
23:50:42:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:391 gen:243 core:0x22 unit:0x0000010f8f59f36f5ec36912d651e428
23:50:42:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13421 run:2387 clone:27 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4e3412434b79
23:50:57:WU00:FS01:0x22:ERROR:NaNs detected in forces. 0 0
23:50:58:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:2387 clone:27 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4e3412434b79
23:51:25:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1826 gen:16 core:0x22 unit:0x000000108f59f36f5ec3691023278959
00:01:39:WU03:FS01:0x22:An exception occurred at step 18573: Particle coordinate is nan
00:01:39:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
01:16:44:WU03:FS01:0x22:An exception occurred at step 159635: Particle coordinate is nan
01:16:44:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
01:52:38:WU03:FS01:0x22:An exception occurred at step 224142: Particle coordinate is nan
01:52:38:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
02:02:57:WU03:FS01:0x22:An exception occurred at step 220126: Particle coordinate is nan
02:02:57:WU03:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
02:03:04:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:1826 gen:16 core:0x22 unit:0x000000108f59f36f5ec3691023278959
02:03:33:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1154 gen:368 core:0x22 unit:0x000001918f59f36f5ec369111cb9089d
02:22:30:WU01:FS01:0x22:An exception occurred at step 38653: Particle coordinate is nan
02:22:30:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
02:38:02:WU01:FS01:0x22:An exception occurred at step 56725: Particle coordinate is nan
02:38:02:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
03:20:15:WU01:FS01:0x22:An exception occurred at step 139053: Particle coordinate is nan
03:20:15:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
03:24:45:WU01:FS01:0x22:An exception occurred at step 132025: Particle coordinate is nan
03:24:45:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
03:24:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1154 gen:368 core:0x22 unit:0x000001918f59f36f5ec369111cb9089d
03:24:59:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:2959 clone:30 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4f696398bc58
03:25:11:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
03:25:11:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:2959 clone:30 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4f696398bc58
03:25:40:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1724 gen:53 core:0x22 unit:0x000000398f59f36f5ec369105fcac154
04:01:31:WU03:FS01:0x22:An exception occurred at step 73793: Particle coordinate is nan
04:01:31:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:37:15:WU03:FS01:0x22:An exception occurred at step 123742: Particle coordinate is nan
04:37:15:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:49:51:WU03:FS01:0x22:An exception occurred at step 125248: Particle coordinate is nan
04:49:51:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:50:57:WU03:FS01:0x22:An exception occurred at step 125248: Particle coordinate is nan
04:50:57:WU03:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
04:51:05:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:1724 gen:53 core:0x22 unit:0x000000398f59f36f5ec369105fcac154
04:51:05:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:4883 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d089067246
04:51:15:WU01:FS01:0x22:ERROR:NaNs detected in forces. 0 0
04:51:16:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:4883 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d089067246
04:51:17:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:4910 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d09c77dd0b
04:51:31:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
04:51:32:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:4910 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d09c77dd0b
******************************* Date: 2020-08-04 *******************************
and so on and so on...
For reference, system info after a restart and that this is still going on:
Code: Select all
******************************* Date: 2020-08-12 *******************************
16:32:45:Read GPUs.txt
16:32:45:Enabled folding slot 00: READY cpu:6
16:32:45:Enabled folding slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590]
16:32:45:****************************** FAHClient ******************************
16:32:45: Version: 7.6.13
16:32:45:******************************* System ********************************
16:32:45: CPU: AMD FX(tm)-8150 Eight-Core Processor
16:32:45: CPU ID: AuthenticAMD Family 21 Model 1 Stepping 2
16:32:45: CPUs: 8
16:32:45: Memory: 11.68GiB
16:32:45: Free Memory: 9.46GiB
16:32:45: Threads: POSIX_THREADS
16:32:45: OS Version: 5.4
16:32:45: Has Battery: false
16:32:45: On Battery: false
16:32:45: UTC Offset: -4
16:32:45: PID: 16712
16:32:45: CWD: /var/lib/fahclient
16:32:45: OS: Linux 5.4.0-42-generic x86_64
16:32:45: OS Arch: AMD64
16:32:45: GPUs: 1
16:32:45: GPU 0: Bus:1 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
16:32:45: 470/480/570/580/590]
16:32:45: CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
16:32:45: libcuda.so: cannot open shared object file: No such file or
16:32:45: directory
16:32:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.10
16:32:49:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:32:57:WU01:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
16:32:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:32:59:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
16:33:07:WU02:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
16:33:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
16:33:33:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:00:14:WU01:FS01:0x22:An exception occurred at step 56223: Particle coordinate is nan
17:00:14:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:16:16:WU01:FS01:0x22:An exception occurred at step 81323: Particle coordinate is nan
17:16:16:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:36:23:WU01:FS01:0x22:An exception occurred at step 115710: Particle coordinate is nan
17:36:23:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:39:22:WU01:FS01:0x22:An exception occurred at step 103411: Particle coordinate is nan
17:39:22:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
17:39:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:39:30:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:45:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
17:39:46:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:47:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
17:39:56:WU03:FS01:0x22:ERROR:NaNs detected in forces. 0 0
17:39:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
17:40:25:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
17:46:13:WU02:FS01:0x22:An exception occurred at step 10290: Particle coordinate is nan
17:46:13:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:09:41:WU02:FS01:0x22:An exception occurred at step 47438: Particle coordinate is nan
18:09:41:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:33:12:WU02:FS01:0x22:An exception occurred at step 73291: Particle coordinate is nan
18:33:12:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:40:37:WU02:FS01:0x22:An exception occurred at step 62749: Particle coordinate is nan
18:40:37:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
18:40:44:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
18:40:45:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
18:40:56:WU01:FS01:0x22:ERROR:NaNs detected in forces. 0 0
18:40:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
18:41:24:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:112 gen:402 core:0x22 unit:0x000001bb8f59f36f5ec36912518a1dea
19:26:39:WU03:FS01:0x22:An exception occurred at step 93622: Particle coordinate is nan
19:26:39:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
No hardware or software changes, rig is active 24/7, restart didn't change behavior.
Advice? Anything a folder can do to make this GPU doing something productive, or just disabling the GPU for a couple of days and see if things get fixed?
(It still burns an extra 150 W, but is it worth it just for handing in a "FAULTY" WU every 30 minutes which other GPUs, according to muziqaz, have no problem returning properly?)