16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by NormalDiffusion »

muziqaz wrote:I have Vega64 liquid edition. Replaced stock fan with two noctua fans for push pull, and it is running out of its mind. temps are super low, and clock are insane :D
That's my plan! But not with the prices now... I could get another rvii for the same money...
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

I have the same issue. Dedicated machine, Win7 64bit, AMD RX570 4gb (not overclocked), Adrenalin 20.2.2, Client ver. 7.6.9, OpenCL 2.0 AMD-APP Driver 3004.8
16600 keep crushing with the same error message "Particle coordinate is nan". Other projects that use Core 22 are fine.
12:50:36:WU00:FS01:0x22:Project: 16600 (Run 0, Clone 796, Gen 387)
12:50:36:WU00:FS01:0x22:Unit: 0x000001aa8f59f36f5ec369110b1585af
12:50:36:WU00:FS01:0x22:Digital signatures verified
12:50:36:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
12:50:36:WU00:FS01:0x22:Version 0.0.11
12:50:36:WU00:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
12:50:36:WU00:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
12:50:36:WU00:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
12:50:36:WU00:FS01:0x22: Global context and integrator variables write interval: disabled
12:51:07:WU00:FS01:0x22:Completed 0 out of 500000 steps (0%)
12:52:56:WU00:FS01:0x22:Completed 5000 out of 500000 steps (1%)
12:54:44:WU00:FS01:0x22:Completed 10000 out of 500000 steps (2%)
12:56:32:WU00:FS01:0x22:Completed 15000 out of 500000 steps (3%)
12:58:21:WU00:FS01:0x22:An exception occurred at step 18071: Particle coordinate is nan
12:58:21:WU00:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
12:58:21:WU00:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
12:58:21:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
12:58:21:WU00:FS01:0x22:Saving result file science.log
12:58:21:WU00:FS01:0x22:Saving result file state.xml
12:58:25:WU00:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
12:58:26:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
Last edited by ViTe on Sat Aug 08, 2020 1:17 pm, edited 1 time in total.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 16600 consistently crashing on AMD Radeon VII

Post by Neil-B »

Just once or every time you get a 16600?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

Not once for sure.
12:42:39:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1483, Gen 38)
12:43:10:WU01:FS01:0x22:Completed 275000 out of 500000 steps (55%)
12:44:39:WU01:FS01:0x22:An exception occurred at step 278860: Particle coordinate is nan
.........
**********************************
10:57:09:WU00:FS01:0x22:Project: 16600 (Run 0, Clone 1579, Gen 29)
10:57:40:WU00:FS01:0x22:Completed 25000 out of 500000 steps (5%)
10:59:21:WU00:FS01:0x22:An exception occurred at step 28362: Particle coordinate is nan

**********************************
02:05:44:WU00:FS01:0x22:Project: 16600 (Run 0, Clone 1096, Gen 319)
02:19:01:WU00:FS01:0x22:Completed 35000 out of 500000 steps (7%)
02:19:58:WU00:FS01:0x22:An exception occurred at step 35641: Particle coordinate is nan
**********************************
01:53:02:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1983, Gen 19)
02:00:42:WU01:FS01:0x22:Completed 320000 out of 500000 steps (64%)
02:01:41:WU01:FS01:0x22:An exception occurred at step 322534: Particle coordinate is nan
Only few runs reached 100% sucsessfully. Most of 16600 are faulty.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

@muziqaz/neil-b/etc.
Maybe we should check with the owner of p16600 and see if they can establish a pattern. I'm guessing that there are a number of unidentified "bad WUs" floating around. I'm also guessing they don't work with specific drivers -- or maybe they don't work in Linux (presumably also a driver issue).

Many of these AMD devices have been poorly serviced by GPUs.txt if they're GPUs and many have been turned off out of frustration if they're CPUs. The total number of individual devices probably represents a pretty wide variety of binary codes and the population of individual members is probably low. How do we find a representative spectrum of useful devices and identify their collective problems?

Many people, including myself, have migrated to nV devices but I have several AMD GPU cards sitting on my workbench which just need (A) a M/B and (B) the time and energy to configure a kit that can run them.

How should we attack this problem? I think we need to systematically gather more data.\ but that's not the only thing that needs to be done. How many are off-line because of the 192.0.2.1 blacklisting process? Representative samples DO need to get data into project 17100.

See also Folding is not fun right now - lots of trouble, no result and others.
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

@Bruce, I did contact owner of the project, they are yet to respond :)
I received an answer from Sukrit in regards to 16448. He is saying it has very high failure rate. Though overnight testing on 3 different AMD GPUs did not come up with any errors at all
FAH Omega tester
gunnarre
Posts: 559
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: 16600 consistently crashing on AMD Radeon VII

Post by gunnarre »

The work units have a core log file (logfile_01.txt) in them. Does this file get uploaded to the WS/CS, even if the WU gets dumped? If so, this file could be parsed server-side for the errors in question. But it doesn't look like this log file has the required information about which GPU, OpenCL and CUDA drivers are installed - only the CPU and OS version.
Image
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

For support, we ask you to post the first ~100 lines of FAH's primary log file. It shows the installation configuration of the GPU(s). Mine looks like this. (Yours will be different, of course).

18:19:51: OS Arch: AMD64
18:19:51: GPUs: 1
18:19:51: GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:7 GP107 [GeForce GTX 1050 Ti] 2138
18:19:51: CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:6.1 Driver:11.0
18:19:51:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:446.14

My GTX 1050 Ti gpu is running with Driver:446.14 and with CUDA 11.0
Nuitari
Posts: 78
Joined: Sun Jun 09, 2019 4:03 am
Hardware configuration: 1x Nvidia 1050ti
1x Nvidia 1660Super
1x Nvidia GTX 660
1x Nvidia 1060 3gb
1x AMD rx570
2x AMD rx560
1x AMD Ryzen 7 PRO 1700
1x AMD Ryzen 7 3700X
1x AMD Phenom II
1x AMD A8-9600
1x Intel i5-4590S

Re: 16600 consistently crashing on AMD Radeon VII

Post by Nuitari »

06:28:12: GPU 0: Bus:0 Slot:2 Func:0 INTEL:1 Gen9p5/GT2 [UHD Graphics 630]
06:28:12: GPU 1: Bus:1 Slot:0 Func:0 AMD:5 Baffin XT [Radeon RX 460]

Card is an RX560
Lots of failures for both p16600 and p13421.
Out of 39 WU for 16600, only 6 succeeded...

Of particular note, this one project:16600 run:0 clone:1314 gen:197 has a driver error displayed in dmesg

Code: Select all

[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d780402 for process FahCore_22 pid 1343 thread FahCore_22 pid 1343
[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0090AFAF
[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E004002
[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0: VM fault (0x02, vmid 7, pasid 32769) at page 9482159, read from 'TC3' (0x54433300) (4)
Its the only one with that error message where I could match it up with the entry.
There are a few more very similar errors on one of my RX570 based rig

Code: Select all

[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0: GPU fault detected: 147 0x0cf80402 for process FahCore_22 pid 3728 thread FahCore_22 pid 3728
[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00C0AD9F
[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004002
[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0: VM fault (0x02, vmid 5, pasid 32800) at page 12627359, read from 'TC3' (0x54433300) (4)
Image
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

Nuitari wrote:06:28:12: GPU 0: Bus:0 Slot:2 Func:0 INTEL:1 Gen9p5/GT2 [UHD Graphics 630]
06:28:12: GPU 1: Bus:1 Slot:0 Func:0 AMD:5 Baffin XT [Radeon RX 460]

Card is an RX560
Lots of failures for both p16600 and p13421.
Out of 39 WU for 16600, only 6 succeeded...

Of particular note, this one project:16600 run:0 clone:1314 gen:197 has a driver error displayed in dmesg

Code: Select all

[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d780402 for process FahCore_22 pid 1343 thread FahCore_22 pid 1343
[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0090AFAF
[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E004002
[Sat Aug  8 04:55:18 2020] amdgpu 0000:01:00.0: VM fault (0x02, vmid 7, pasid 32769) at page 9482159, read from 'TC3' (0x54433300) (4)
Its the only one with that error message where I could match it up with the entry.
There are a few more very similar errors on one of my RX570 based rig

Code: Select all

[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0: GPU fault detected: 147 0x0cf80402 for process FahCore_22 pid 3728 thread FahCore_22 pid 3728
[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00C0AD9F
[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004002
[Thu Aug  6 23:25:24 2020] amdgpu 0000:07:00.0: VM fault (0x02, vmid 5, pasid 32800) at page 12627359, read from 'TC3' (0x54433300) (4)
In fahcontrol, you have to delete Intel GPU slot. Best would be to manually remove both GPU slots. And then add a new slot for just Intel iGPU. Also your rx560 is weirdly recognised as rx460. Is the OS in VM?
FAH Omega tester
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 16600 consistently crashing on AMD Radeon VII

Post by Joe_H »

muziqaz wrote:In fahcontrol, you have to delete Intel GPU slot. Best would be to manually remove both GPU slots. And then add a new slot for just Intel iGPU. Also your rx560 is weirdly recognised as rx460. Is the OS in VM?
Do Not Delete the slot for the Intel GPU if it is detected, especially if running v7.6.13. There is a bug in the F@h client, it will just recreate the slot as long as that Intel GPU is enabled as a test platform for the internal testers. I don't know if the bug also causes the same problem on older versions than 7.6.13.

Deleting the drivers and OpenCL support for the Intel iGPU may keep it from being detected by the client.

The RX 460 and RX 560 are the same device, AMD just reused the GPU chip at a slightly different configuration but with the same Device ID.

To keep the Intel GPU from requesting work since it will not get any, first pause the slot by right-clicking on it in FAHControl. Then in Configure select the Slots tab and click on the GPU slot for the Intel GPU and then Edit. Add the Extra Slot Option 'pause-on-start' and set its value to 'true'. OK the changes and Save.

Afterwards if you pause folding, start the slots by right-clicking on any but the one for the Intel GPU.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Nuitari
Posts: 78
Joined: Sun Jun 09, 2019 4:03 am
Hardware configuration: 1x Nvidia 1050ti
1x Nvidia 1660Super
1x Nvidia GTX 660
1x Nvidia 1060 3gb
1x AMD rx570
2x AMD rx560
1x AMD Ryzen 7 PRO 1700
1x AMD Ryzen 7 3700X
1x AMD Phenom II
1x AMD A8-9600
1x Intel i5-4590S

Re: 16600 consistently crashing on AMD Radeon VII

Post by Nuitari »

No where was it ever mentioned that the Intel GPU slot was causing any issue. Its part of what the client sees, however it cannot use it as its not included in the opencl devices.
Image
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by NormalDiffusion »

Some more info about 16600 WUs over the week end:
It seems to be an AMD only problem:

Code: Select all

- Titan Xp on i9-7900x: 24 from 24 completed  -- 0% failure rate 
- RTX 2070 Super on dual Xeon E5-2690: 13 from 13 completed  -- 0% failure rate 
- Radeon VII on Xeon E5-1650v4: 4 from 19 completed  -- 79% failure rate 
- Radeon VII on Xeon E5-1620v2: 2 from 26 completed  -- 92% failure rate 
- Radeon 290x on Xeon E5-1620v2: 2 from 20 completed  -- 90% failure rate
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

Project owners hasn't replied yet. Until they do there is nothing we can do, I'm afraid
FAH Omega tester
ViTe
Posts: 20
Joined: Tue Feb 14, 2012 2:22 am

Re: 16600 consistently crashing on AMD Radeon VII

Post by ViTe »

muziqaz wrote:Project owners hasn't replied yet. Until they do there is nothing we can do, I'm afraid
We can. Ban the WS at your machine and you never see 16600 again :roll:
Post Reply