16600 consistently crashing on AMD Radeon VII

Moderators: Site Moderators, FAHC Science Team

ThWuensche
Posts: 79
Joined: Fri May 29, 2020 4:10 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by ThWuensche »

muziqaz wrote: nVidia is doing that.
FAH dev creates fahcore>nVidia rep takes that core and runs it through their hardware in their lab with all their driver profilers and tools>driver team either optimises the drivers for the fahcore, or they give suggestions/submit patches of code to fah devs to improve fahcore.
Hardware vendor does not need to have source code in order to optimise for the code.
I know how much nVidia is involved, and I just don't see the same involvement from AMD, not even close, which is a shame, as their hardware was always very strong in pure compute tasks.
Also, fah devs mainly have nVidia hardware, as far as I know. I do not believe there are any AMD GPUs in their possession. At least we can be content that AMD CPUs punched through Intel wall when it comes to fah
As far as nVidia support goes, that's good. AMD is also listed as contributor on the FAH website, maybe they could be motivated to provide such support at least for the ROCm stack, the project they advertise as open and scalable HPC solution. Can't be in their interest to have nVidia recognized as running without problems and AMD being consistently troublesome.

If FAH developers develop and tests mostly on nVidia hardware, but do not have AMD GPUs, then it's no wonder that in the wild FAH is having more trouble on AMD hardware. But in that case AMD should be asked for what they have their logo as supporter on the website.

As for need of source code, there may be a difference between optimizing and debugging. For optimizing it may be enough to profile which kernels are run at what frequency, without the need to understand the calculation flow. For debugging it however is very helpful to understand what is going on, what should be going on and at what point and under which preconditions there is a failure. That even more, as part of the failures point in the direction of early failures, which might be caused by missing/invalid initialization. Understanding can best be achieved by following code function through the logics of source code and observing the effects (follow variable values ...). If FAH developers can't follow that flow due to lack of appropriate (AMD) systems and want to do it by feedback from series (published) core versions, for me that looks like a rather long turnaround time for debugging. To be effective such turnaround times should be measured in minutes, not weeks between released core versions. Just my thoughts, I well might miss something important.
n_w95482
Posts: 66
Joined: Tue May 01, 2012 12:46 am
Hardware configuration: CPU: Ryzen 7 5800X3D

GPU: Radeon RX 6700 XT, Radeon RX 6900 XT
Location: California

Re: 16600 consistently crashing on AMD Radeon VII

Post by n_w95482 »

muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)
The change seems to have worked on my RX 580. It finished the 16600 WU it was working on when I posted earlier and has since worked on 25 consecutive 13423's and currently working on a 13422; all with no issues. Thank you!
Folding since December 2003. In memory of my mother, who lost her battle with cancer.

Image
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

n_w95482 wrote:
muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)
The change seems to have worked on my RX 580. It finished the 16600 WU it was working on when I posted earlier and has since worked on 25 consecutive 13423's and currently working on a 13422; all with no issues. Thank you!
Good to know, thanks :)
FAH Omega tester
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by UofM.MartinK »

Same here, no 16600 anymore for my RX580 since the last one received 2020-08-19 02:00:51 GMT - and as to celebrate that, that last WU, project:16600 run:0 clone:1393 gen:201, completed after four "Particle coordinate is nan" checkpoint resumes :)
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by PantherX »

ThWuensche wrote:...But in that case AMD should be asked for what they have their logo as supporter on the website...
The F@H system has almost 20 years under the belt. AMD (at that time, ATI) were the first GPU to support folding. They did have play a decent role in developing FahCore to optimize their GPU/drivers for folding. However, over the years, things have changed and currently, it seems that Nvidia has sufficient presence to provide the right guidance/testing/debugging to ensure that FahCore can fully utilize their GPUs. I do hope that someone from AMD would be able to provide a similar level of support but time will tell :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
ThWuensche
Posts: 79
Joined: Fri May 29, 2020 4:10 pm

Re: 16600 consistently crashing on AMD Radeon VII

Post by ThWuensche »

PantherX wrote:
ThWuensche wrote:...But in that case AMD should be asked for what they have their logo as supporter on the website...
The F@H system has almost 20 years under the belt. AMD (at that time, ATI) were the first GPU to support folding. They did have play a decent role in developing FahCore to optimize their GPU/drivers for folding. However, over the years, things have changed and currently, it seems that Nvidia has sufficient presence to provide the right guidance/testing/debugging to ensure that FahCore can fully utilize their GPUs. I do hope that someone from AMD would be able to provide a similar level of support but time will tell :)
So let's hope somebody from AMD ROCm team is following this thread :wink:
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 16600 consistently crashing on AMD Radeon VII

Post by bruce »

There are a number of GPUs that you might be talking about. Please post the PCI codes associated with the on you're using. (using lspci or GPU-Z)

I understand P16600 is not assigning to AMD Species 5 which is a broad group of AMD devices._
muziqaz
Posts: 942
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: 16600 consistently crashing on AMD Radeon VII

Post by muziqaz »

bruce wrote:There are a number of GPUs that you might be talking about. Please post the PCI codes associated with the on you're using. (using lspci or GPU-Z)

I understand P16600 is not assigning to AMD Species 5 which is a broad group of AMD devices._
Bruce, they all run AMD, and they are not suppose to get any 16600 :) They are just reporting, that they are no longer receiving them on AMD species 5. All is good
FAH Omega tester
Post Reply