16600 consistently crashing on AMD Radeon VII
Moderators: Site Moderators, FAHC Science Team
Re: 16600 consistently crashing on AMD Radeon VII
Now that time has passed since the last post by UofM.MartinK, we cah look again at what happened to his 16600 WU. It has been assigned to another machine and completed successfully.
https://apps.foldingathome.org/wu#proje ... 12&gen=402
Now that we know that another machine has completed it successfully, that increases the chances that it's a local problem or it's a driver problem. We can hope that by comparing your error report with the successful completion by someone else we can see what differences caused the crash. Any suggestions?
https://apps.foldingathome.org/wu#proje ... 12&gen=402
Now that we know that another machine has completed it successfully, that increases the chances that it's a local problem or it's a driver problem. We can hope that by comparing your error report with the successful completion by someone else we can see what differences caused the crash. Any suggestions?
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Site Moderator
- Posts: 6986
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
Once FAHBench has been updated to support FahCore_22 then yes, you can run individual WUs.gunnarre wrote:...Is it possible to get FAHBench to work on a chosen good work unit? (project:13421 run:3765 clone:27 gen:1 works on the RX580 under Windows here...
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
-
- Posts: 2040
- Joined: Sat Dec 01, 2012 3:43 pm
- Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441
Re: 16600 consistently crashing on AMD Radeon VII
I can provide a Windows prebuild FAHbench with FahCore_22 if anyone is interested. Or you can build from source OpenMM and FahBench yourself.
viewtopic.php?f=38&t=24225&p=327396&hilit=fahbench#p327396
viewtopic.php?f=38&t=24225&p=327396&hilit=fahbench#p327396
-
- Posts: 59
- Joined: Tue Apr 07, 2020 8:53 pm
Re: 16600 consistently crashing on AMD Radeon VII
Out of the blue, two units succeeded (one 16600, one 13421), now a lot of fails again:
All of the 16600 fails were completed by others in the meanwhile:
https://apps.foldingathome.org/wu#proje ... 09&gen=126 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 92&gen=280 (failed by another AMD, completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 12&gen=402 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 50&gen=151 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 45&gen=109 (completed by NVIDIA)
Whereas all 13421's show no additional results but my fails yet.
Have yet to find a WU my RX580 failed and another AMD GPU completed, but I am sure they are out there since this seems all a statistics game.
In what regard would FAHbench'ing help in this case? This rig runs Ubuntu 20.04. If it makes sense to FAHbench'ing, anything I can help with, like provide some WU work directory snapshots?
I will try underclocking the GPU next if I have the time.
Code: Select all
******************************* Date: 2020-08-12 *******************************
16:32:58:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3286 clone:5 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd63082338
16:33:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3286 clone:11 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0dd9e814fe6
17:39:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1509 gen:126 core:0x22 unit:0x000000988f59f36f5ec36911abc746db
17:39:46:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:3240 clone:69 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d52793d8a1
17:39:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13421 run:3240 clone:83 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d5e5dd3bdc
18:40:44:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:16600 run:0 clone:692 gen:280 core:0x22 unit:0x000001448f59f36f5ec36911ee8b859f
18:40:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13421 run:3200 clone:39 gen:1 core:0x22 unit:0x0000000112bc7d9a5f1fc0d518f7b783
20:16:42:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:16600 run:0 clone:112 gen:402 core:0x22 unit:0x000001bb8f59f36f5ec36912518a1dea
23:20:29:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:1450 gen:151 core:0x22 unit:0x000000aa8f59f36f5ec369114358cbbf
23:36:43:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:16600 run:0 clone:1745 gen:109 core:0x22 unit:0x000000798f59f36f5ec3691054460df8
08:43:33:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:16600 run:0 clone:1189 gen:420 core:0x22 unit:0x000001d48f59f36f5ec369117e373557
******************************* Date: 2020-08-13 *******************************
11:01:26:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13421 run:7982 clone:63 gen:2 core:0x22 unit:0x0000000212bc7d9a5f26fb5a703f266e
12:59:37:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:7828 clone:80 gen:2 core:0x22 unit:0x0000000212bc7d9a5f26fb55a92eb41f
12:59:52:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13421 run:7695 clone:22 gen:2 core:0x22 unit:0x0000000312bc7d9a5f224a430ccd3540
13:00:03:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13421 run:7695 clone:46 gen:2 core:0x22 unit:0x0000000312bc7d9a5f224a41fbe4d9f2
https://apps.foldingathome.org/wu#proje ... 09&gen=126 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 92&gen=280 (failed by another AMD, completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 12&gen=402 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 50&gen=151 (completed by NVIDIA)
https://apps.foldingathome.org/wu#proje ... 45&gen=109 (completed by NVIDIA)
Whereas all 13421's show no additional results but my fails yet.
Have yet to find a WU my RX580 failed and another AMD GPU completed, but I am sure they are out there since this seems all a statistics game.
In what regard would FAHbench'ing help in this case? This rig runs Ubuntu 20.04. If it makes sense to FAHbench'ing, anything I can help with, like provide some WU work directory snapshots?
I will try underclocking the GPU next if I have the time.
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
I believe we might make a decision to ban AMD folding on Linux fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good.
FAH Omega tester
Re: 16600 consistently crashing on AMD Radeon VII
I've seen some reports of ROCm working when the regular AMD Pro drivers don't work. Could that be something to try?
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
Re: 16600 consistently crashing on AMD Radeon VII
That would be a real shame but it's better than continuing to do more harm than good. I wonder if it's possible to (A) find dependable drivers and (B) (somehow) ban the "lottery" drivers.muziqaz wrote:I believe we might make a decision to ban AMD folding on Linux fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good.
It might be possible that all the WUs that fail with these AMD drivers are later completed with nV drivers on nV hardware but from the limited information I can gather, I have no way to explore the validity of such a guess.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
Problem with linux, it has 5 billion different flavours, with half a billion different drivers, and another quarter of the billion OpenCL packages. Everything nearly handpicked and DIY created. While on Windows you get a single official driver package, and if that doesn't work, you go, throw a stone to MS window and hope for the best. With linux there are so many variables, its insane. Combine that with catastrophic hit or miss stability from AMD, and we have a recipe for disaster
FAH Omega tester
-
- Posts: 59
- Joined: Tue Apr 07, 2020 8:53 pm
Re: 16600 consistently crashing on AMD Radeon VII
I don't know if it's the moon phase or something, but my rig returned another 16600 WU successfully, and is going strong on the next one.
Keep in mind, nothing changed on the rig - even temperature is pretty constant.
project:16600 run:0 clone:391 gen:243 (completed by NVIDIA)
project:16600 run:0 clone:1826 gen:16 (completed by NVIDIA)
project:16600 run:0 clone:1154 gen:368 (failed by another AMD, completed by NVIDIA)
project:16600 run:0 clone:1724 gen:53 (completed by NVIDIA)
But keep in mind, my RX580 also completed some other 16600 in the meanwhile, although some with "restarting".
It is pretty clear that this is a very specific GPU(architecture)<>WU combination issue, perhaps facilitated by the driver. And then a lot of rolling dice - once I get a lot of fails again, I will update the driver - but even if it works better for two days in a row without a fail, it won't tell us much?
I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
For reference, driver since creation of this rig in April, is amdgpu-pro 20.10-1048554 on Ubuntu 20.04
Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
Keep in mind, nothing changed on the rig - even temperature is pretty constant.
I checked into some more of the 16600 WUs which failed on my RX580 on August 4th, the same picture:bruce wrote:It might be possible that all the WUs that fail with these AMD drivers are later completed with nV drivers on nV hardware but from the limited information I can gather, I have no way to explore the validity of such a guess.
project:16600 run:0 clone:391 gen:243 (completed by NVIDIA)
project:16600 run:0 clone:1826 gen:16 (completed by NVIDIA)
project:16600 run:0 clone:1154 gen:368 (failed by another AMD, completed by NVIDIA)
project:16600 run:0 clone:1724 gen:53 (completed by NVIDIA)
But keep in mind, my RX580 also completed some other 16600 in the meanwhile, although some with "restarting".
It is pretty clear that this is a very specific GPU(architecture)<>WU combination issue, perhaps facilitated by the driver. And then a lot of rolling dice - once I get a lot of fails again, I will update the driver - but even if it works better for two days in a row without a fail, it won't tell us much?
I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
For reference, driver since creation of this rig in April, is amdgpu-pro 20.10-1048554 on Ubuntu 20.04
Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
-
- Posts: 124
- Joined: Sat Apr 18, 2020 1:50 pm
Re: 16600 consistently crashing on AMD Radeon VII
The Sapphire nitro+ is factory overclocked. Could you try to lower it to the default rx580 clock?UofM.MartinK wrote: Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
Had to do this on my Sapphire 290x...
Re: 16600 consistently crashing on AMD Radeon VII
The client knows the driver version but I don't believe that the FAHCore places that information in the error report. That would potentially be an enhancement for the FAHCore.I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
The fact that your GPU is (factory) overclocked is potentially another important piece of information, so we have to ask you to make that change as NormalDiffusion gas syggested and report back to us.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: 16600 consistently crashing on AMD Radeon VII
On Windows, science.log contains a version reported by the OpenCL implementation and is sent to the server when a work unit returned. (2348.4) corresponds to the Product version attribute of amdocl64.dll.bruce wrote:The client knows the driver version but I don't believe that the FAHCore places that information in the error report. That would potentially be an enhancement for the FAHCore.I really wonder how we could get enough statistics around that... the client doesn't track driver version, I assume?
Code: Select all
PROFILE = FULL_PROFILE
VERSION = OpenCL 2.0 AMD-APP (2348.4)
NAME = AMD Accelerated Parallel Processing
VENDOR = Advanced Micro Devices, Inc.
Code: Select all
19:09:45:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:2348.4
Re: 16600 consistently crashing on AMD Radeon VII
All RX580 Nitro+ has unchanged base core clock (1257Mhz). Only boost clock is a bit higherNormalDiffusion wrote:The Sapphire nitro+ is factory overclocked. Could you try to lower it to the default rx580 clock?UofM.MartinK wrote: Also, it might be that individual cards with the same chip, or from different brands (mine is a Sapphire Nitro+) behave differently.
Had to do this on my Sapphire 290x...
-
- Posts: 59
- Joined: Tue Apr 07, 2020 8:53 pm
Re: 16600 consistently crashing on AMD Radeon VII
Okay, found the driver in the log file, just have to figure out what the number refers to, thanks, _r2w_ben !
The thing with the boost clock is interesting, I did some more data mining, and got munin to display all temperature and fan speeds the "sensors" package found for that rig:
Update 8/15: I originally wrote "Temp1" instead of "Edge" was the GPU temperature, it was late in the day, sorry for the confusion.
"Edge" seems to be the GPU temperature, and whenever it is "high", most WUs fail, and when it is "medium", most WUs complete, and "low" might be no GPU WU active.
So something puts the card into either state (high temp perhaps boost clock state?) - the driver being a top candidate.
I will update the driver, and keep reporting. Perhaps also playing with downclocking later, but I will prioritize understanding the causes (since other AMD cards seem affected as well) and focusing on finding the "simplest" fix. An option to tell the driver to disable "boosting", for example.
Update: For now, instead of updating the driver, I just enabled the "POWER SAVING" profile instead of the default "3D_FULL_SCREEN" profile in pp_power_profile_mode. I still see the SCLK to boost to 1411 MHz occasionally (the card seems to be able to do the following frequencies: 300, 600, 900, 1145, 1215, 1257, 1300, and 1411Mhz). Will disallow the 1411 MHz state next if WUs are still failing in "POWER SAVING" mode.
Code: Select all
04:34:20:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:3075.10
Update 8/15: I originally wrote "Temp1" instead of "Edge" was the GPU temperature, it was late in the day, sorry for the confusion.
"Edge" seems to be the GPU temperature, and whenever it is "high", most WUs fail, and when it is "medium", most WUs complete, and "low" might be no GPU WU active.
So something puts the card into either state (high temp perhaps boost clock state?) - the driver being a top candidate.
I will update the driver, and keep reporting. Perhaps also playing with downclocking later, but I will prioritize understanding the causes (since other AMD cards seem affected as well) and focusing on finding the "simplest" fix. An option to tell the driver to disable "boosting", for example.
Update: For now, instead of updating the driver, I just enabled the "POWER SAVING" profile instead of the default "3D_FULL_SCREEN" profile in pp_power_profile_mode. I still see the SCLK to boost to 1411 MHz occasionally (the card seems to be able to do the following frequencies: 300, 600, 900, 1145, 1215, 1257, 1300, and 1411Mhz). Will disallow the 1411 MHz state next if WUs are still failing in "POWER SAVING" mode.
Last edited by UofM.MartinK on Sat Aug 15, 2020 9:23 pm, edited 1 time in total.
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
OpenCL v3075 is quite recent. So OpenCL is up to date and shouldn't be a factor.
FAH Omega tester