16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

ViTe wrote:... and you never see 16600 again
And an unknown list of future projects that might be distributed from that server.

The other thing that's probably worth investigating is alternate drivers. AMD has not been particularly good at providing reliable drivers for many of their past GPUs.
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

bruce wrote:
ViTe wrote:... and you never see 16600 again
And an unknown list of future projects that might be distributed from that server..
C'mon :e) , I hope we have no kids here and everybody understand prolonged effect of this ban. Just remove it after 1 month or so. Its not the first time when we getting projects with problems. I always do this way.
bruce wrote:The other thing that's probably worth investigating is alternate drivers. AMD has not been particularly good at providing reliable drivers for many of their past GPUs.
As far as I see this discussion we already checked a few versions of drivers and few generations of cards and got the same result. Makes no sence to waste time to play with it. Looks like Nvidia cards have no issues so let them do this job and we do something else.
Nuitari
Posts: 78
Joined: Sun Jun 09, 2019 4:03 am
Hardware configuration: 1x Nvidia 1050ti
1x Nvidia 1660Super
1x Nvidia GTX 660
1x Nvidia 1060 3gb
1x AMD rx570
2x AMD rx560
1x AMD Ryzen 7 PRO 1700
1x AMD Ryzen 7 3700X
1x AMD Phenom II
1x AMD A8-9600
1x Intel i5-4590S

Re: 16600 consistently crashing on AMD Radeon VII

Post by Nuitari »

The project owner could easily reconfigure things to block AMD GPU from it.
Doesn't fix the fact that there is something broken either in the core or the project.
Image
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

Nah, project owner needs to reply to my request, and then we can bring that project back to testing stage to see what is going on. Unfortunately owner is missing in action for a moment :)
No one is gonna ban anything without investigation ;) project ran fine with older fahcore versions and since initial testing we had 7 new revisions to the fahcore.
We have a lot of things going on at the same time which might improve the constraint system. Might
FAH Omega tester
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

So the project owner didn't reply to you in one week? Oh, what a rush... ))

Guys, I'd say its not smart to waste machine time like this. If you see 70-90%% failure rate on AMD and no failure on Nvidia so its obvious that this job is for Nvidia cards and AMD cards shouldn't receive it. Testing, investigating and other internal kitchen stuff is on testers, not for our eyes. Its not our business. We do calculations and we'd like to get math what we can do. We found a problem, we reported it, thats it. Its enough to make quick desision to shift calculations to platform that able to do it sucsessfully.
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

First off, there is no 70-90% failure rate on AMD, period. We have couple of people reporting this on couple of models. For certain people it is high failure rate, not for all. OP had issues with 16448, yet we ran this project on 3 different AMD cards (including one which OP has) and we did not encounter a single failure.
Second of all, when WU fails, it sends it back to collection server with partial credit given. That WU is straight away put back into queue for another folder. If a WU fails certain amount of times, it will be blacklisted for further inspection.
Yes, it is unfortunate, that project owner is on holiday (you know UNIs have holidays during the summer) or maybe is tied up with other stuff (some of the scientists are in the season of defending their thesis), but it is not end of the world, so calm your horses, guys, everything will be OK
FAH Omega tester
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by UofM.MartinK »

Perhaps another valuable data point:

Since August 3rd, my RX580 didn't manage to return ANY assigned WU successfully.
Before that date, it was already processing project 13421 exclusively for days. Usually successfully, with some odd streaks of faulting several WUs in a row with things like:

'Force RMSE error of 22.695 with threshold of 5'
'Potential energy error of 46.9965, threshold of 10'
'An exception occurred at step 224142: Particle coordinate is nan'
'NaNs detected in forces. 0 0'
'Discrepancy: Forces are blowing up! 0 0'

And on August 3rd, coincidentally(?) when project 16600 joined the mix, not a single RX580 WU completed anymore on that rig, same mix of error messages as above, but just for every single WU (which all happen to be of either project 13421 or 16600)

Code: Select all

******************************* Date: 2020-08-03 *******************************
22:09:54:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:7887 clone:21 gen:0 core:0x22 unit:0x0000000112bc7d9a5f26fb4f3a86697f
22:11:20:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13421 run:8140 clone:12 gen:0 core:0x22 unit:0x0000000012bc7d9a5f284d8be4253e2e
22:11:21:WU00:FS01:Final credit estimate, 12664.00 points
23:07:18:WU02:FS01:0x22:ERROR:Potential energy error of 12.5142, threshold of 10
23:07:18:WU02:FS01:0x22:ERROR:Reference Potential Energy: -56187.4 | Given Potential Energy: -56200
23:07:18:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:7887 clone:21 gen:0 core:0x22 unit:0x0000000112bc7d9a5f26fb4f3a86697f
23:07:19:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13421 run:4773 clone:26 gen:0 core:0x22 unit:0x0000000312bc7d9a5f20bd44a8a60c11
23:07:28:WU00:FS01:0x22:ERROR:Potential energy error of 46.9965, threshold of 10
23:07:28:WU00:FS01:0x22:ERROR:Reference Potential Energy: -57187 | Given Potential Energy: -57234
23:07:28:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:4773 clone:26 gen:0 core:0x22 unit:0x0000000312bc7d9a5f20bd44a8a60c11
23:07:57:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:391 gen:243 core:0x22 unit:0x0000010f8f59f36f5ec36912d651e428
******************************* Date: 2020-08-03 *******************************
23:48:35:WU02:FS01:0x22:An exception occurred at step 84335: Particle coordinate is nan
23:48:35:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
23:50:34:WU02:FS01:0x22:An exception occurred at step 76303: Particle coordinate is nan
23:50:34:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
23:50:42:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:391 gen:243 core:0x22 unit:0x0000010f8f59f36f5ec36912d651e428
23:50:42:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13421 run:2387 clone:27 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4e3412434b79
23:50:57:WU00:FS01:0x22:ERROR:NaNs detected in forces. 0 0
23:50:58:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:2387 clone:27 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4e3412434b79
23:51:25:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1826 gen:16 core:0x22 unit:0x000000108f59f36f5ec3691023278959
00:01:39:WU03:FS01:0x22:An exception occurred at step 18573: Particle coordinate is nan
00:01:39:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
01:16:44:WU03:FS01:0x22:An exception occurred at step 159635: Particle coordinate is nan
01:16:44:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
01:52:38:WU03:FS01:0x22:An exception occurred at step 224142: Particle coordinate is nan
01:52:38:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
02:02:57:WU03:FS01:0x22:An exception occurred at step 220126: Particle coordinate is nan
02:02:57:WU03:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
02:03:04:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:1826 gen:16 core:0x22 unit:0x000000108f59f36f5ec3691023278959
02:03:33:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1154 gen:368 core:0x22 unit:0x000001918f59f36f5ec369111cb9089d
02:22:30:WU01:FS01:0x22:An exception occurred at step 38653: Particle coordinate is nan
02:22:30:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
02:38:02:WU01:FS01:0x22:An exception occurred at step 56725: Particle coordinate is nan
02:38:02:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
03:20:15:WU01:FS01:0x22:An exception occurred at step 139053: Particle coordinate is nan
03:20:15:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
03:24:45:WU01:FS01:0x22:An exception occurred at step 132025: Particle coordinate is nan
03:24:45:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
03:24:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1154 gen:368 core:0x22 unit:0x000001918f59f36f5ec369111cb9089d
03:24:59:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:2959 clone:30 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4f696398bc58
03:25:11:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
03:25:11:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:2959 clone:30 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4f696398bc58
03:25:40:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1724 gen:53 core:0x22 unit:0x000000398f59f36f5ec369105fcac154
04:01:31:WU03:FS01:0x22:An exception occurred at step 73793: Particle coordinate is nan
04:01:31:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:37:15:WU03:FS01:0x22:An exception occurred at step 123742: Particle coordinate is nan
04:37:15:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:49:51:WU03:FS01:0x22:An exception occurred at step 125248: Particle coordinate is nan
04:49:51:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:50:57:WU03:FS01:0x22:An exception occurred at step 125248: Particle coordinate is nan
04:50:57:WU03:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
04:51:05:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:1724 gen:53 core:0x22 unit:0x000000398f59f36f5ec369105fcac154
04:51:05:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:4883 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d089067246
04:51:15:WU01:FS01:0x22:ERROR:NaNs detected in forces. 0 0
04:51:16:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:4883 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d089067246
04:51:17:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:4910 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d09c77dd0b
04:51:31:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
04:51:32:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:4910 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d09c77dd0b
******************************* Date: 2020-08-04 *******************************
and so on and so on...

For reference, system info after a restart and that this is still going on:

Code: Select all

******************************* Date: 2020-08-12 *******************************
16:32:45:Read GPUs.txt
16:32:45:Enabled folding slot 00: READY cpu:6
16:32:45:Enabled folding slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590]
16:32:45:****************************** FAHClient ******************************
16:32:45:        Version: 7.6.13
16:32:45:******************************* System ********************************
16:32:45:            CPU: AMD FX(tm)-8150 Eight-Core Processor
16:32:45:         CPU ID: AuthenticAMD Family 21 Model 1 Stepping 2
16:32:45:           CPUs: 8
16:32:45:         Memory: 11.68GiB
16:32:45:    Free Memory: 9.46GiB
16:32:45:        Threads: POSIX_THREADS
16:32:45:     OS Version: 5.4
16:32:45:    Has Battery: false
16:32:45:     On Battery: false
16:32:45:     UTC Offset: -4
16:32:45:            PID: 16712
16:32:45:            CWD: /var/lib/fahclient
16:32:45:             OS: Linux 5.4.0-42-generic x86_64
16:32:45:        OS Arch: AMD64
16:32:45:           GPUs: 1
16:32:45:          GPU 0: Bus:1 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
16:32:45:                 470/480/570/580/590]
16:32:45:           CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
16:32:45:                 libcuda.so: cannot open shared object file: No such file or
16:32:45:                 directory
16:32:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.10
16:32:49:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:32:57:WU01:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
16:32:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:32:59:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
16:33:07:WU02:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
16:33:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
16:33:33:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:00:14:WU01:FS01:0x22:An exception occurred at step 56223: Particle coordinate is nan
17:00:14:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:16:16:WU01:FS01:0x22:An exception occurred at step 81323: Particle coordinate is nan
17:16:16:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:36:23:WU01:FS01:0x22:An exception occurred at step 115710: Particle coordinate is nan
17:36:23:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:39:22:WU01:FS01:0x22:An exception occurred at step 103411: Particle coordinate is nan
17:39:22:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
17:39:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:39:30:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:45:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
17:39:46:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:47:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
17:39:56:WU03:FS01:0x22:ERROR:NaNs detected in forces. 0 0
17:39:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
17:40:25:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
17:46:13:WU02:FS01:0x22:An exception occurred at step 10290: Particle coordinate is nan
17:46:13:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:09:41:WU02:FS01:0x22:An exception occurred at step 47438: Particle coordinate is nan
18:09:41:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:33:12:WU02:FS01:0x22:An exception occurred at step 73291: Particle coordinate is nan
18:33:12:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:40:37:WU02:FS01:0x22:An exception occurred at step 62749: Particle coordinate is nan
18:40:37:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
18:40:44:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
18:40:45:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
18:40:56:WU01:FS01:0x22:ERROR:NaNs detected in forces. 0 0
18:40:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
18:41:24:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:112 gen:402 core:0x22 unit:0x000001bb8f59f36f5ec36912518a1dea
19:26:39:WU03:FS01:0x22:An exception occurred at step 93622: Particle coordinate is nan
19:26:39:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
No hardware or software changes, rig is active 24/7, restart didn't change behavior.

Advice? Anything a folder can do to make this GPU doing something productive, or just disabling the GPU for a couple of days and see if things get fixed?

(It still burns an extra 150 W, but is it worth it just for handing in a "FAULTY" WU every 30 minutes which other GPUs, according to muziqaz, have no problem returning properly?)
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

Based on https://apps.foldingathome.org/wu#proje ... e=21&gen=0, your GPU had an error and so did somebody else so I can't tell if the WU is really bad or there's a problem in your system and somebody else's.

project:16600 run:0 clone:112 gen:402 is no help, either.

I don't know much about project:16600
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by NormalDiffusion »

UofM.MartinK wrote: Advice? Anything a folder can do to make this GPU doing something productive, or just disabling the GPU for a couple of days and see if things get fixed?

(It still burns an extra 150 W, but is it worth it just for handing in a "FAULTY" WU every 30 minutes which other GPUs, according to muziqaz, have no problem returning properly?)
Keep it running! If everyone starts to blacklist WS and thus WUs, no one will notice the problem. So there will be no improvement in the future and it may even get worse.
For amd owners it's even worse, our failed WUs get diluted I think the mass of Nvidia cards (OS stats Page is giving me an error right now so can't give you the actual numbers...).

I know it's annoying to see all this failed WUs, but keep folding!
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

FAH has a team who are collecting failure reports like that one. They've been fixing the high-probability errors but the list is long and isolating a fix for each one often takes additional testing.

P134yy is a replacement for p134xx but with a shorter list.
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

NormalDiffusion wrote: Keep it running! If everyone starts to blacklist WS and thus WUs, no one will notice the problem. So there will be no improvement in the future...
We see multiple reports of the problem with good description in this topic and we made our reports - it's enough to escalate the investigation and bring the problem to attention of scientists admins and testers. And thats it! All other tests/analitics/improvements are on them.
Thousands of other users who don't monitor their clients and dont read this forum will generate more faulty returns. But for us: this project is done and we shouldn't waste our calculation powers on this anymore. Hundreds of our faulty returns is not important, cos its all the same. It doesnt matter how many faulty returns we do: 20 or 200. If we are the only people with faults it means that its just a local problem (software/hardware/combo) and it makes not much sence to investigate it and no sence for us to waste our resources if all other projects are running with no issues.
Its much more important IF faults happening on many OTHER machines. Its an indicator that its not just a local problem. Testers/admins/scientists should investigate it. We give them the signal and now its their turn to check and do the job if necessary.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

As I said above, the error reports are prioritized by their frequency. "Forces are blowing up" has been reported by you and by others, and they represent a small fraction of all errors reported by p1342* Yes, Testers/admins/scientists should investigate it and they will when they've finished working on the problems that are occurring at a higher rate.

I do not believe it's a local problem, but it might be. Some percentage of WUs crash spontaneously but they also crash because of bad drivers or because of unstable overclocking and for other reasons. If there's enough information in your error report(s) to figure out which is applicable, thank you for the report(s). ... except if it's a spontaneous case.
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by UofM.MartinK »

I wasn't thinking of blacklisting the WS, I am considering pausing the RX580 slot to save the energy.

It only gets WUs from project:13421 and project:16600 for 10+ straight days now, and never completed a single one, always sends them back as "faulty" after working on them from somewhere between 0 seconds and 2 hours.

I wonder if these projects trigger some very specific problems/properties of certain (individual) cards - think race condition or even "rowhammer" and the like - because when and how each of these WUs fails, at least on my system, seems very statistically distributed over time, and retries bail out at very different step counts, and it's more the norm (and not the exception) that the WU might reach the next checkpoint on retry.

That being said, one WU just made it to 73% after two "restart from last good checkpoint" messages before "ERROR:114: Max number of attempts to resume from last checkpoint reached." and being sent back as Faulty.

I can't find a setting to increase the numbers of retries, it doesn't seem to be max-unit-errors which defaults to 5 instead of 2 or 3.

If I could configure 10 retries instead of 2, it might actually complete about half of the assigned WUs.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by PantherX »

UofM.MartinK wrote:...That being said, one WU just made it to 73% after two "restart from last good checkpoint" messages before "ERROR:114: Max number of attempts to resume from last checkpoint reached." and being sent back as Faulty.

I can't find a setting to increase the numbers of retries, it doesn't seem to be max-unit-errors which defaults to 5 instead of 2 or 3...
That setting is set on the Project level and is present in the WU itself. By default, the value is 2 but that can be changed by the Project Owner. From observation, it is a setting that is rarely changed.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
gunnarre
Posts: 559
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: 16600 consistently crashing on AMD Radeon VII

Post by gunnarre »

I'm seeing an RX580 on Windows which is returning plenty of valid results on project 13421 without any restarts. Is it possible to get FAHBench to work on a chosen good work unit? (project:13421 run:3765 clone:27 gen:1 works on the RX580 under Windows here.) Or I could test one of the 16600 WUs which is NaN'ing on UofM.MartinK's RX580.
Image
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
Post Reply