Page 3 of 8

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Mon Aug 10, 2020 11:12 pm
by bruce
ViTe wrote:... and you never see 16600 again
And an unknown list of future projects that might be distributed from that server.

The other thing that's probably worth investigating is alternate drivers. AMD has not been particularly good at providing reliable drivers for many of their past GPUs.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Tue Aug 11, 2020 12:23 am
by ViTe
bruce wrote:
ViTe wrote:... and you never see 16600 again
And an unknown list of future projects that might be distributed from that server..
C'mon :e) , I hope we have no kids here and everybody understand prolonged effect of this ban. Just remove it after 1 month or so. Its not the first time when we getting projects with problems. I always do this way.
bruce wrote:The other thing that's probably worth investigating is alternate drivers. AMD has not been particularly good at providing reliable drivers for many of their past GPUs.
As far as I see this discussion we already checked a few versions of drivers and few generations of cards and got the same result. Makes no sence to waste time to play with it. Looks like Nvidia cards have no issues so let them do this job and we do something else.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Tue Aug 11, 2020 2:14 pm
by Nuitari
The project owner could easily reconfigure things to block AMD GPU from it.
Doesn't fix the fact that there is something broken either in the core or the project.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Tue Aug 11, 2020 2:16 pm
by muziqaz
Nah, project owner needs to reply to my request, and then we can bring that project back to testing stage to see what is going on. Unfortunately owner is missing in action for a moment :)
No one is gonna ban anything without investigation ;) project ran fine with older fahcore versions and since initial testing we had 7 new revisions to the fahcore.
We have a lot of things going on at the same time which might improve the constraint system. Might

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 12, 2020 12:47 am
by ViTe
So the project owner didn't reply to you in one week? Oh, what a rush... ))

Guys, I'd say its not smart to waste machine time like this. If you see 70-90%% failure rate on AMD and no failure on Nvidia so its obvious that this job is for Nvidia cards and AMD cards shouldn't receive it. Testing, investigating and other internal kitchen stuff is on testers, not for our eyes. Its not our business. We do calculations and we'd like to get math what we can do. We found a problem, we reported it, thats it. Its enough to make quick desision to shift calculations to platform that able to do it sucsessfully.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 12, 2020 6:27 am
by muziqaz
First off, there is no 70-90% failure rate on AMD, period. We have couple of people reporting this on couple of models. For certain people it is high failure rate, not for all. OP had issues with 16448, yet we ran this project on 3 different AMD cards (including one which OP has) and we did not encounter a single failure.
Second of all, when WU fails, it sends it back to collection server with partial credit given. That WU is straight away put back into queue for another folder. If a WU fails certain amount of times, it will be blacklisted for further inspection.
Yes, it is unfortunate, that project owner is on holiday (you know UNIs have holidays during the summer) or maybe is tied up with other stuff (some of the scientists are in the season of defending their thesis), but it is not end of the world, so calm your horses, guys, everything will be OK

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 12, 2020 8:18 pm
by UofM.MartinK
Perhaps another valuable data point:

Since August 3rd, my RX580 didn't manage to return ANY assigned WU successfully.
Before that date, it was already processing project 13421 exclusively for days. Usually successfully, with some odd streaks of faulting several WUs in a row with things like:

'Force RMSE error of 22.695 with threshold of 5'
'Potential energy error of 46.9965, threshold of 10'
'An exception occurred at step 224142: Particle coordinate is nan'
'NaNs detected in forces. 0 0'
'Discrepancy: Forces are blowing up! 0 0'

And on August 3rd, coincidentally(?) when project 16600 joined the mix, not a single RX580 WU completed anymore on that rig, same mix of error messages as above, but just for every single WU (which all happen to be of either project 13421 or 16600)

Code: Select all

******************************* Date: 2020-08-03 *******************************
22:09:54:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:7887 clone:21 gen:0 core:0x22 unit:0x0000000112bc7d9a5f26fb4f3a86697f
22:11:20:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13421 run:8140 clone:12 gen:0 core:0x22 unit:0x0000000012bc7d9a5f284d8be4253e2e
22:11:21:WU00:FS01:Final credit estimate, 12664.00 points
23:07:18:WU02:FS01:0x22:ERROR:Potential energy error of 12.5142, threshold of 10
23:07:18:WU02:FS01:0x22:ERROR:Reference Potential Energy: -56187.4 | Given Potential Energy: -56200
23:07:18:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:7887 clone:21 gen:0 core:0x22 unit:0x0000000112bc7d9a5f26fb4f3a86697f
23:07:19:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13421 run:4773 clone:26 gen:0 core:0x22 unit:0x0000000312bc7d9a5f20bd44a8a60c11
23:07:28:WU00:FS01:0x22:ERROR:Potential energy error of 46.9965, threshold of 10
23:07:28:WU00:FS01:0x22:ERROR:Reference Potential Energy: -57187 | Given Potential Energy: -57234
23:07:28:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:4773 clone:26 gen:0 core:0x22 unit:0x0000000312bc7d9a5f20bd44a8a60c11
23:07:57:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:391 gen:243 core:0x22 unit:0x0000010f8f59f36f5ec36912d651e428
******************************* Date: 2020-08-03 *******************************
23:48:35:WU02:FS01:0x22:An exception occurred at step 84335: Particle coordinate is nan
23:48:35:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
23:50:34:WU02:FS01:0x22:An exception occurred at step 76303: Particle coordinate is nan
23:50:34:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
23:50:42:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:391 gen:243 core:0x22 unit:0x0000010f8f59f36f5ec36912d651e428
23:50:42:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13421 run:2387 clone:27 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4e3412434b79
23:50:57:WU00:FS01:0x22:ERROR:NaNs detected in forces. 0 0
23:50:58:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:2387 clone:27 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4e3412434b79
23:51:25:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1826 gen:16 core:0x22 unit:0x000000108f59f36f5ec3691023278959
00:01:39:WU03:FS01:0x22:An exception occurred at step 18573: Particle coordinate is nan
00:01:39:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
01:16:44:WU03:FS01:0x22:An exception occurred at step 159635: Particle coordinate is nan
01:16:44:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
01:52:38:WU03:FS01:0x22:An exception occurred at step 224142: Particle coordinate is nan
01:52:38:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
02:02:57:WU03:FS01:0x22:An exception occurred at step 220126: Particle coordinate is nan
02:02:57:WU03:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
02:03:04:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:1826 gen:16 core:0x22 unit:0x000000108f59f36f5ec3691023278959
02:03:33:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1154 gen:368 core:0x22 unit:0x000001918f59f36f5ec369111cb9089d
02:22:30:WU01:FS01:0x22:An exception occurred at step 38653: Particle coordinate is nan
02:22:30:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
02:38:02:WU01:FS01:0x22:An exception occurred at step 56725: Particle coordinate is nan
02:38:02:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
03:20:15:WU01:FS01:0x22:An exception occurred at step 139053: Particle coordinate is nan
03:20:15:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
03:24:45:WU01:FS01:0x22:An exception occurred at step 132025: Particle coordinate is nan
03:24:45:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
03:24:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1154 gen:368 core:0x22 unit:0x000001918f59f36f5ec369111cb9089d
03:24:59:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:2959 clone:30 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4f696398bc58
03:25:11:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
03:25:11:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:2959 clone:30 gen:0 core:0x22 unit:0x0000000212bc7d9a5f1f4f696398bc58
03:25:40:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1724 gen:53 core:0x22 unit:0x000000398f59f36f5ec369105fcac154
04:01:31:WU03:FS01:0x22:An exception occurred at step 73793: Particle coordinate is nan
04:01:31:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:37:15:WU03:FS01:0x22:An exception occurred at step 123742: Particle coordinate is nan
04:37:15:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:49:51:WU03:FS01:0x22:An exception occurred at step 125248: Particle coordinate is nan
04:49:51:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:50:57:WU03:FS01:0x22:An exception occurred at step 125248: Particle coordinate is nan
04:50:57:WU03:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
04:51:05:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:1724 gen:53 core:0x22 unit:0x000000398f59f36f5ec369105fcac154
04:51:05:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:4883 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d089067246
04:51:15:WU01:FS01:0x22:ERROR:NaNs detected in forces. 0 0
04:51:16:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:4883 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d089067246
04:51:17:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:4910 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d09c77dd0b
04:51:31:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
04:51:32:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:4910 clone:31 gen:0 core:0x22 unit:0x0000000112bc7d9a5f2249d09c77dd0b
******************************* Date: 2020-08-04 *******************************
and so on and so on...

For reference, system info after a restart and that this is still going on:

Code: Select all

******************************* Date: 2020-08-12 *******************************
16:32:45:Read GPUs.txt
16:32:45:Enabled folding slot 00: READY cpu:6
16:32:45:Enabled folding slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590]
16:32:45:****************************** FAHClient ******************************
16:32:45:        Version: 7.6.13
16:32:45:******************************* System ********************************
16:32:45:            CPU: AMD FX(tm)-8150 Eight-Core Processor
16:32:45:         CPU ID: AuthenticAMD Family 21 Model 1 Stepping 2
16:32:45:           CPUs: 8
16:32:45:         Memory: 11.68GiB
16:32:45:    Free Memory: 9.46GiB
16:32:45:        Threads: POSIX_THREADS
16:32:45:     OS Version: 5.4
16:32:45:    Has Battery: false
16:32:45:     On Battery: false
16:32:45:     UTC Offset: -4
16:32:45:            PID: 16712
16:32:45:            CWD: /var/lib/fahclient
16:32:45:             OS: Linux 5.4.0-42-generic x86_64
16:32:45:        OS Arch: AMD64
16:32:45:           GPUs: 1
16:32:45:          GPU 0: Bus:1 Slot:0 Func:0 AMD:5 Ellesmere XT [Radeon RX
16:32:45:                 470/480/570/580/590]
16:32:45:           CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
16:32:45:                 libcuda.so: cannot open shared object file: No such file or
16:32:45:                 directory
16:32:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.10
16:32:49:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:32:57:WU01:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
16:32:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:32:59:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
16:33:07:WU02:FS01:0x22:ERROR:Discrepancy: Forces are blowing up! 0 0
16:33:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
16:33:33:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:00:14:WU01:FS01:0x22:An exception occurred at step 56223: Particle coordinate is nan
17:00:14:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:16:16:WU01:FS01:0x22:An exception occurred at step 81323: Particle coordinate is nan
17:16:16:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:36:23:WU01:FS01:0x22:An exception occurred at step 115710: Particle coordinate is nan
17:36:23:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:39:22:WU01:FS01:0x22:An exception occurred at step 103411: Particle coordinate is nan
17:39:22:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
17:39:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:39:30:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:45:WU02:FS01:0x22:ERROR:NaNs detected in forces. 0 0
17:39:46:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:47:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
17:39:56:WU03:FS01:0x22:ERROR:NaNs detected in forces. 0 0
17:39:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
17:40:25:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
17:46:13:WU02:FS01:0x22:An exception occurred at step 10290: Particle coordinate is nan
17:46:13:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:09:41:WU02:FS01:0x22:An exception occurred at step 47438: Particle coordinate is nan
18:09:41:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:33:12:WU02:FS01:0x22:An exception occurred at step 73291: Particle coordinate is nan
18:33:12:WU02:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:40:37:WU02:FS01:0x22:An exception occurred at step 62749: Particle coordinate is nan
18:40:37:WU02:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
18:40:44:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
18:40:45:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
18:40:56:WU01:FS01:0x22:ERROR:NaNs detected in forces. 0 0
18:40:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
18:41:24:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:16600 run:0 clone:112 gen:402 core:0x22 unit:0x000001bb8f59f36f5ec36912518a1dea
19:26:39:WU03:FS01:0x22:An exception occurred at step 93622: Particle coordinate is nan
19:26:39:WU03:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
No hardware or software changes, rig is active 24/7, restart didn't change behavior.

Advice? Anything a folder can do to make this GPU doing something productive, or just disabling the GPU for a couple of days and see if things get fixed?

(It still burns an extra 150 W, but is it worth it just for handing in a "FAULTY" WU every 30 minutes which other GPUs, according to muziqaz, have no problem returning properly?)

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 12, 2020 9:01 pm
by bruce
Based on https://apps.foldingathome.org/wu#proje ... e=21&gen=0, your GPU had an error and so did somebody else so I can't tell if the WU is really bad or there's a problem in your system and somebody else's.

project:16600 run:0 clone:112 gen:402 is no help, either.

I don't know much about project:16600

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 12, 2020 9:11 pm
by NormalDiffusion
UofM.MartinK wrote: Advice? Anything a folder can do to make this GPU doing something productive, or just disabling the GPU for a couple of days and see if things get fixed?

(It still burns an extra 150 W, but is it worth it just for handing in a "FAULTY" WU every 30 minutes which other GPUs, according to muziqaz, have no problem returning properly?)
Keep it running! If everyone starts to blacklist WS and thus WUs, no one will notice the problem. So there will be no improvement in the future and it may even get worse.
For amd owners it's even worse, our failed WUs get diluted I think the mass of Nvidia cards (OS stats Page is giving me an error right now so can't give you the actual numbers...).

I know it's annoying to see all this failed WUs, but keep folding!

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 12, 2020 9:33 pm
by bruce
FAH has a team who are collecting failure reports like that one. They've been fixing the high-probability errors but the list is long and isolating a fix for each one often takes additional testing.

P134yy is a replacement for p134xx but with a shorter list.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 12:54 am
by ViTe
NormalDiffusion wrote: Keep it running! If everyone starts to blacklist WS and thus WUs, no one will notice the problem. So there will be no improvement in the future...
We see multiple reports of the problem with good description in this topic and we made our reports - it's enough to escalate the investigation and bring the problem to attention of scientists admins and testers. And thats it! All other tests/analitics/improvements are on them.
Thousands of other users who don't monitor their clients and dont read this forum will generate more faulty returns. But for us: this project is done and we shouldn't waste our calculation powers on this anymore. Hundreds of our faulty returns is not important, cos its all the same. It doesnt matter how many faulty returns we do: 20 or 200. If we are the only people with faults it means that its just a local problem (software/hardware/combo) and it makes not much sence to investigate it and no sence for us to waste our resources if all other projects are running with no issues.
Its much more important IF faults happening on many OTHER machines. Its an indicator that its not just a local problem. Testers/admins/scientists should investigate it. We give them the signal and now its their turn to check and do the job if necessary.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 4:03 am
by bruce
As I said above, the error reports are prioritized by their frequency. "Forces are blowing up" has been reported by you and by others, and they represent a small fraction of all errors reported by p1342* Yes, Testers/admins/scientists should investigate it and they will when they've finished working on the problems that are occurring at a higher rate.

I do not believe it's a local problem, but it might be. Some percentage of WUs crash spontaneously but they also crash because of bad drivers or because of unstable overclocking and for other reasons. If there's enough information in your error report(s) to figure out which is applicable, thank you for the report(s). ... except if it's a spontaneous case.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 4:19 am
by UofM.MartinK
I wasn't thinking of blacklisting the WS, I am considering pausing the RX580 slot to save the energy.

It only gets WUs from project:13421 and project:16600 for 10+ straight days now, and never completed a single one, always sends them back as "faulty" after working on them from somewhere between 0 seconds and 2 hours.

I wonder if these projects trigger some very specific problems/properties of certain (individual) cards - think race condition or even "rowhammer" and the like - because when and how each of these WUs fails, at least on my system, seems very statistically distributed over time, and retries bail out at very different step counts, and it's more the norm (and not the exception) that the WU might reach the next checkpoint on retry.

That being said, one WU just made it to 73% after two "restart from last good checkpoint" messages before "ERROR:114: Max number of attempts to resume from last checkpoint reached." and being sent back as Faulty.

I can't find a setting to increase the numbers of retries, it doesn't seem to be max-unit-errors which defaults to 5 instead of 2 or 3.

If I could configure 10 retries instead of 2, it might actually complete about half of the assigned WUs.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 5:31 am
by PantherX
UofM.MartinK wrote:...That being said, one WU just made it to 73% after two "restart from last good checkpoint" messages before "ERROR:114: Max number of attempts to resume from last checkpoint reached." and being sent back as Faulty.

I can't find a setting to increase the numbers of retries, it doesn't seem to be max-unit-errors which defaults to 5 instead of 2 or 3...
That setting is set on the Project level and is present in the WU itself. By default, the value is 2 but that can be changed by the Project Owner. From observation, it is a setting that is rarely changed.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 13, 2020 5:43 am
by gunnarre
I'm seeing an RX580 on Windows which is returning plenty of valid results on project 13421 without any restarts. Is it possible to get FAHBench to work on a chosen good work unit? (project:13421 run:3765 clone:27 gen:1 works on the RX580 under Windows here.) Or I could test one of the 16600 WUs which is NaN'ing on UofM.MartinK's RX580.