Page 8 of 8
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Thu Aug 20, 2020 7:02 pm
by ThWuensche
muziqaz wrote:
nVidia is doing that.
FAH dev creates fahcore>nVidia rep takes that core and runs it through their hardware in their lab with all their driver profilers and tools>driver team either optimises the drivers for the fahcore, or they give suggestions/submit patches of code to fah devs to improve fahcore.
Hardware vendor does not need to have source code in order to optimise for the code.
I know how much nVidia is involved, and I just don't see the same involvement from AMD, not even close, which is a shame, as their hardware was always very strong in pure compute tasks.
Also, fah devs mainly have nVidia hardware, as far as I know. I do not believe there are any AMD GPUs in their possession. At least we can be content that AMD CPUs punched through Intel wall when it comes to fah
As far as nVidia support goes, that's good. AMD is also listed as contributor on the FAH website, maybe they could be motivated to provide such support at least for the ROCm stack, the project they advertise as open and scalable HPC solution. Can't be in their interest to have nVidia recognized as running without problems and AMD being consistently troublesome.
If FAH developers develop and tests mostly on nVidia hardware, but do not have AMD GPUs, then it's no wonder that in the wild FAH is having more trouble on AMD hardware. But in that case AMD should be asked for what they have their logo as supporter on the website.
As for need of source code, there may be a difference between optimizing and debugging. For optimizing it may be enough to profile which kernels are run at what frequency, without the need to understand the calculation flow. For debugging it however is very helpful to understand what is going on, what should be going on and at what point and under which preconditions there is a failure. That even more, as part of the failures point in the direction of early failures, which might be caused by missing/invalid initialization. Understanding can best be achieved by following code function through the logics of source code and observing the effects (follow variable values ...). If FAH developers can't follow that flow due to lack of appropriate (AMD) systems and want to do it by feedback from series (published) core versions, for me that looks like a rather long turnaround time for debugging. To be effective such turnaround times should be measured in minutes, not weeks between released core versions. Just my thoughts, I well might miss something important.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Thu Aug 20, 2020 8:32 pm
by n_w95482
muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU
The change seems to have worked on my RX 580. It finished the 16600 WU it was working on when I posted earlier and has since worked on 25 consecutive 13423's and currently working on a 13422; all with no issues. Thank you!
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Thu Aug 20, 2020 8:39 pm
by muziqaz
n_w95482 wrote:muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU
The change seems to have worked on my RX 580. It finished the 16600 WU it was working on when I posted earlier and has since worked on 25 consecutive 13423's and currently working on a 13422; all with no issues. Thank you!
Good to know, thanks
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 21, 2020 1:12 am
by UofM.MartinK
Same here, no 16600 anymore for my RX580 since the last one received 2020-08-19 02:00:51 GMT - and as to celebrate that, that last WU, project:16600 run:0 clone:1393 gen:201, completed after four "Particle coordinate is nan" checkpoint resumes
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 21, 2020 11:50 am
by PantherX
ThWuensche wrote:...But in that case AMD should be asked for what they have their logo as supporter on the website...
The F@H system has almost 20 years under the belt. AMD (at that time, ATI) were the first GPU to support folding. They did have play a decent role in developing FahCore to optimize their GPU/drivers for folding. However, over the years, things have changed and currently, it seems that Nvidia has sufficient presence to provide the right guidance/testing/debugging to ensure that FahCore can fully utilize their GPUs. I do hope that someone from AMD would be able to provide a similar level of support but time will tell
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 21, 2020 2:46 pm
by ThWuensche
PantherX wrote:ThWuensche wrote:...But in that case AMD should be asked for what they have their logo as supporter on the website...
The F@H system has almost 20 years under the belt. AMD (at that time, ATI) were the first GPU to support folding. They did have play a decent role in developing FahCore to optimize their GPU/drivers for folding. However, over the years, things have changed and currently, it seems that Nvidia has sufficient presence to provide the right guidance/testing/debugging to ensure that FahCore can fully utilize their GPUs. I do hope that someone from AMD would be able to provide a similar level of support but time will tell
So let's hope somebody from AMD ROCm team is following this thread
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 21, 2020 4:45 pm
by bruce
There are a number of GPUs that you might be talking about. Please post the PCI codes associated with the on you're using. (using lspci or GPU-Z)
I understand P16600 is not assigning to AMD Species 5 which is a broad group of AMD devices._
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 21, 2020 5:56 pm
by muziqaz
bruce wrote:There are a number of GPUs that you might be talking about. Please post the PCI codes associated with the on you're using. (using lspci or GPU-Z)
I understand P16600 is not assigning to AMD Species 5 which is a broad group of AMD devices._
Bruce, they all run AMD, and they are not suppose to get any 16600
They are just reporting, that they are no longer receiving them on AMD species 5. All is good