16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

Now that time has passed since the last post by UofM.MartinK, we cah look again at what happened to his 16600 WU. It has been assigned to another machine and completed successfully.

https://apps.foldingathome.org/wu#proje ... 12&gen=402

Now that we know that another machine has completed it successfully, that increases the chances that it's a local problem or it's a driver problem. We can hope that by comparing your error report with the successful completion by someone else we can see what differences caused the crash. Any suggestions?
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by PantherX »

gunnarre wrote:...Is it possible to get FAHBench to work on a chosen good work unit? (project:13421 run:3765 clone:27 gen:1 works on the RX580 under Windows here...
Once FAHBench has been updated to support FahCore_22 then yes, you can run individual WUs.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: 16600 consistently crashing on AMD Radeon VII

Post by foldy »

I can provide a Windows prebuild FAHbench with FahCore_22 if anyone is interested. Or you can build from source OpenMM and FahBench yourself.
viewtopic.php?f=38&t=24225&p=327396&hilit=fahbench#p327396
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by UofM.MartinK »

Out of the blue, two units succeeded (one 16600, one 13421), now a lot of fails again:

Code: Select all

******************************* Date: 2020-08-12 *******************************
16:32:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:33:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
17:39:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:39:46:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
18:40:44:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
18:40:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
20:16:42:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:112 gen:402 core:0x22 unit:0x000001bb8f59f36f5ec36912518a1dea
23:20:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1450 gen:151 core:0x22 unit:0x000000aa8f59f36f5ec369114358cbbf
23:36:43:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:16600 run:0 clone:1745 gen:109 core:0x22 unit:0x000000798f59f36f5ec3691054460df8
08:43:33:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:16600 run:0 clone:1189 gen:420 core:0x22 unit:0x000001d48f59f36f5ec369117e373557
******************************* Date: 2020-08-13 *******************************
11:01:26:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13421 run:7982 clone:63 gen:2 core:0x22 unit:0x0000000212bc7d9a5f26fb5a703f266e
12:59:37:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:7828 clone:80 gen:2 core:0x22 unit:0x0000000212bc7d9a5f26fb55a92eb41f
12:59:52:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:7695 clone:22 gen:2 core:0x22 unit:0x0000000312bc7d9a5f224a430ccd3540
13:00:03:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:7695 clone:46 gen:2 core:0x22 unit:0x0000000312bc7d9a5f224a41fbe4d9f2
All of the 16600 fails were completed by others in the meanwhile:
https://apps.foldingathome.org/wu#proje ... 09&gen=126 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 92&gen=280 (failed by another AMD, completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 12&gen=402 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 50&gen=151 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 45&gen=109 (completed by NVIDIA)

Whereas all 13421's show no additional results but my fails yet.

Have yet to find a WU my RX580 failed and another AMD GPU completed, but I am sure they are out there since this seems all a statistics game.

In what regard would FAHbench'ing help in this case? This rig runs Ubuntu 20.04. If it makes sense to FAHbench'ing, anything I can help with, like provide some WU work directory snapshots?

I will try underclocking the GPU next if I have the time.
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

I believe we might make a decision to ban AMD folding on Linux :D fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good. :(
FAH Omega tester
gunnarre
Posts: 559
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: 16600 consistently crashing on AMD Radeon VII

Post by gunnarre »

I've seen some reports of ROCm working when the regular AMD Pro drivers don't work. Could that be something to try?
Image
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

muziqaz wrote:I believe we might make a decision to ban AMD folding on Linux :D fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good. :(
That would be a real shame but it's better than continuing to do more harm than good. I wonder if it's possible to (A) find dependable drivers and (B) (somehow) ban the "lottery" drivers. :idea:

It might be possible that all the WUs that fail with these AMD drivers are later completed with nV drivers on nV hardware but from the limited information I can gather, I have no way to explore the validity of such a guess.
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

Problem with linux, it has 5 billion different flavours, with half a billion different drivers, and another quarter of the billion OpenCL packages. Everything nearly handpicked and DIY created. While on Windows you get a single official driver package, and if that doesn't work, you go, throw a stone to MS window and hope for the best. With linux there are so many variables, its insane. Combine that with catastrophic hit or miss stability from AMD, and we have a recipe for disaster
FAH Omega tester
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by UofM.MartinK »

I don't know if it's the moon phase or something, but my rig returned another 16600 WU successfully, and is going strong on the next one.

Keep in mind, nothing changed on the rig - even temperature is pretty constant.
bruce wrote:It might be possible that all the WUs that fail with these AMD drivers are later completed with nV drivers on nV hardware but from the limited information I can gather, I have no way to explore the validity of such a guess.
I checked into some more of the 16600 WUs which failed on my RX580 on August 4th, the same picture:

project:16600 run:0 clone:391 gen:243 (completed by NVIDIA)
project:16600 run:0 clone:1826 gen:16 (completed by NVIDIA)
project:16600 run:0 clone:1154 gen:368 (failed by another AMD, completed by NVIDIA)
project:16600 run:0 clone:1724 gen:53 (completed by NVIDIA)

But keep in mind, my RX580 also completed some other 16600 in the meanwhile, although some with "restarting".

It is pretty clear that this is a very specific GPU(architecture)<>WU combination issue, perhaps facilitated by the driver. And then a lot of rolling dice - once I get a lot of fails again, I will update the driver - but even if it works better for two days in a row without a fail, it won't tell us much?

I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?

For reference, driver since creation of this rig in April, is amdgpu-pro 20.10-1048554 on Ubuntu 20.04

Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by NormalDiffusion »

UofM.MartinK wrote: Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
The Sapphire nitro+ is factory overclocked. Could you try to lower it to the default rx580 clock?
Had to do this on my Sapphire 290x...
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
The client knows the driver version but I don't believe that the FAHCore places that information in the error report. That would potentially be an enhancement for the FAHCore.

The fact that your GPU is (factory) overclocked is potentially another important piece of information, so we have to ask you to make that change as NormalDiffusion gas syggested and report back to us. :!:
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by _r2w_ben »

bruce wrote:
I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
The client knows the driver version but I don't believe that the FAHCore places that information in the error report. That would potentially be an enhancement for the FAHCore.
On Windows, science.log contains a version reported by the OpenCL implementation and is sent to the server when a work unit returned. (2348.4) corresponds to the Product version attribute of amdocl64.dll.

Code: Select all

  PROFILE = FULL_PROFILE
  VERSION = OpenCL 2.0 AMD-APP (2348.4)
  NAME = AMD Accelerated Parallel Processing
  VENDOR = Advanced Micro Devices, Inc.
It's the same version number reported when FAHClient starts up.

Code: Select all

19:09:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:2348.4
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

NormalDiffusion wrote:
UofM.MartinK wrote: Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
The Sapphire nitro+ is factory overclocked. Could you try to lower it to the default rx580 clock?
Had to do this on my Sapphire 290x...
All RX580 Nitro+ has unchanged base core clock (1257Mhz). Only boost clock is a bit higher
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by UofM.MartinK »

Okay, found the driver in the log file, just have to figure out what the number refers to, thanks, _r2w_ben ! :)

Code: Select all

04:34:20:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.10
The thing with the boost clock is interesting, I did some more data mining, and got munin to display all temperature and fan speeds the "sensors" package found for that rig:

Image

Update 8/15: I originally wrote "Temp1" instead of "Edge" was the GPU temperature, it was late in the day, sorry for the confusion.
"Edge" seems to be the GPU temperature, and whenever it is "high", most WUs fail, and when it is "medium", most WUs complete, and "low" might be no GPU WU active.

So something puts the card into either state (high temp perhaps boost clock state?) - the driver being a top candidate.

I will update the driver, and keep reporting. Perhaps also playing with downclocking later, but I will prioritize understanding the causes (since other AMD cards seem affected as well) and focusing on finding the "simplest" fix. An option to tell the driver to disable "boosting", for example.

Update: For now, instead of updating the driver, I just enabled the "POWER SAVING" profile instead of the default "3D_FULL_SCREEN" profile in pp_power_profile_mode. I still see the SCLK to boost to 1411 MHz occasionally (the card seems to be able to do the following frequencies: 300, 600, 900, 1145, 1215, 1257, 1300, and 1411Mhz). Will disallow the 1411 MHz state next if WUs are still failing in "POWER SAVING" mode.
Last edited by UofM.MartinK on Sat Aug 15, 2020 9:23 pm, edited 1 time in total.
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

OpenCL v3075 is quite recent. So OpenCL is up to date and shouldn't be a factor.
FAH Omega tester
Post Reply