Page 7 of 8

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 10:00 am
by Neil-B
@muziqaz ... so specific (guessing amd) cards are running failure rates in the 80-90% range over a range of Projects is what you are confirming? ... If so can we change the thread topic to something more relevant than current which implies the discussion/issue is about a single project ... the two recent sets of failure rates posted in this thread have very different failure rates profiles across projects - one is specific to this project for the most part the other appears to be failing on potentially all projects - albeit there may be some projects it is actually working on that have not been posted - this might indicated two different types of scenario?

Maybe even move the topic thread if it is a much wider issue from the "Issues with a specific WU" forum as it appears from what you are confirming it isn't.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 1:12 pm
by UofM.MartinK
At least for me, this thread's focus and main finding, as in the title, is p16600 - which now has been disabled for everything AMD before Navi (5000 series), yeah! Thanks, muziqaz!

The "waters" are muddled if you just look at the plain failure rates per project reported in this thread, because at least the p134XX series is also showing up as failures (I guess on many cards, but also on the before-Navi AMD cards which are affected by the totally unrelated p16600). But those p134XX failures seem to be far less critical, because they almost always happen in the first 9-17 seconds. And if not failing right in the beginning, those projects seem to usually complete without any further hickup.

This thread is a very good example how several effects can overlay and make the data hard to interpret, and made "the usually correct" explanation of just overclocked hardware etc appearing very convincing.

Even I was convinced my card has a hardware, clock or driver issue, and spent almost a week fiddling with drivers & underclocking, and now even more posters come forward reporting going through the same motions... almost like a #p16600metoo :)

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 1:21 pm
by muziqaz
The goal of this thread was finally achieved by bringing this issue to project owner's attention. The project now is being checked out in more detail and excluded for problematic hardware (for now).

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 5:46 pm
by Neil-B
Please could you get the Moonshot 134xx series Project owners to assess these impact of rapid failures from types of cards ... whilst from the folders perspective having the vast majority fail within the first 20 seconds I really can't see that that makes sense ... If a number of these cards are failing WUs at speed the chances that WUs that could be valid and folded without issue on say nvidia cards get 5 failures from these cards and get labled as bad when they aren't necessarily so makes little sense to me.

The speed of failure of potentially ok WUs by these cards means that it would not take very many of them to raise the statistical chance of a WU being hit by 5 of these card related failures to occur to non negligible levels ... Moonshot WUs are quick results - but potentially throwing away valid results due to a group of "rapid fail doesn't matter because it doesn't impact our usefulness" gpus seems madness to me :(

Perhaps someone could check all WUs that have had five failed returns and check that they are not all quick failures from these types of gpu?

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 6:04 pm
by muziqaz
Moonshot project owner knows about failed WUs, and has accepted it as is. They are not failing as much as 16600 or 16448.
Moonshot has similar failure rate on all different GPUs, not just AMD. This is due to the nature of simulation being done. The owner of Moonshot project is also one of the lead devs for fahcore_22, so while other project owners are researchers using fahcore_22, Moonshot owner actually developed and updated fahcore_22 to be able to do Moonshot type simulations :) In that update process a lo of other bugs and issues have been fixed :)

P.S. Out of 77 13422s my 3 different GPUs received, only 2 of them failed. One on Navi, one on either Radeon VII or Vega64

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 6:29 pm
by Neil-B
... but on another machine we have been told of failures rates of 30 out of 37 for p13421 and 8 out of 9 WUs for p13423: 9 WUs !!!! ... and those errors are being assigned to the Project WUs and that level of errors is seemingly being accepted as normal with everyone happy for the machine to just keep fast erroring WUs en masse.

Heck, if everyone is fine with machines failing at this rate as declared earlier in this thread - and continuing to do so - when others such as yours or mine have minimal failure rates then fine - I've tried to make the potentially overlooking perfectly good WUs due to this issue ... I guess it the Project Owner is actually happy to have this level of failures from a single machine (at least 80&) then far be it from me to argue.

I'll wind my neck in and simply ignore the absurdity of this scenario.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 7:57 pm
by muziqaz
Single AMD GPU is 0.00001% of horsepower in the see of nVidia GPUs ;) of you have one machine which fails constantly but have 10000 other machines which fold same project stably, would you go out of the way to halt the project (which needs to be finished as soon as possible)? Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities ;)

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 8:47 pm
by ThWuensche
muziqaz wrote:Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities ;)
Of course I'm repeating myself, but as FAH devs are a very valuable and limited resource, they should mostly stick to the core development (in the sense of science) and should leave at least part of the debugging to others - which implies providing the source. Could AMD even be actively involved in debugging, would they get access to the source to find out what triggers failures, even if it is caused by weaknesses in their openCL stack? Who else could get the source and help in debugging? I'm aware that JohnChodera is really listening and active to get things solved, but probably third party help would speed up things. In a closed company project you can say "We limit us to this and that hardware to reduce compatibility issues", but in a project relying on the contribution of volunteers spread around the world the problems of contributors need to be taken serious. If you start to say "We don't care, it's only a small number of contributors, so not worthwhile to deal with" it will hurt the project as a whole.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Wed Aug 19, 2020 9:24 pm
by ViTe
Neil-B wrote:... and high rates of failure on 13421 (30 of 37 failed) and 13423 (7 of 8 failed) on the same rig ... that doesn't just feel like an issue with the 16600 project as far as that rig is concerned ... yes the 34 of 38 failures on 16600 may be down to an issue with the project but with the wider failures it feels like a rig issue or possibly an incompatible core to rig issue
That might be interesting. 13423 WUs never failed on my machine. Last 2-3 days it was the only project I got and 100% of them completed successfully.
Dedicated machine, Win7 64bit, AMD RX570 4gb (not overclocked), Adrenalin 20.2.2, Client ver. 7.6.9, OpenCL 2.0 AMD-APP Driver 3004.8

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 20, 2020 7:07 am
by gunnarre
ThWuensche wrote:Who else could get the source and help in debugging?
The folding cores (the part that runs on the GPU) is already from open source projects, and if I understand correctly they're planning to make the whole client open source. But if AMD or Apple were interested in helping out by improving their drivers or even making a folding core for Metal, then closed source isn't a hindrance to that.

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 20, 2020 8:45 am
by muziqaz
ThWuensche wrote:
muziqaz wrote:Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities ;)
Of course I'm repeating myself, but as FAH devs are a very valuable and limited resource, they should mostly stick to the core development (in the sense of science) and should leave at least part of the debugging to others - which implies providing the source. Could AMD even be actively involved in debugging, would they get access to the source to find out what triggers failures, even if it is caused by weaknesses in their openCL stack? Who else could get the source and help in debugging? I'm aware that JohnChodera is really listening and active to get things solved, but probably third party help would speed up things. In a closed company project you can say "We limit us to this and that hardware to reduce compatibility issues", but in a project relying on the contribution of volunteers spread around the world the problems of contributors need to be taken serious. If you start to say "We don't care, it's only a small number of contributors, so not worthwhile to deal with" it will hurt the project as a whole.
nVidia is doing that.
FAH dev creates fahcore>nVidia rep takes that core and runs it through their hardware in their lab with all their driver profilers and tools>driver team either optimises the drivers for the fahcore, or they give suggestions/submit patches of code to fah devs to improve fahcore.
Hardware vendor does not need to have source code in order to optimise for the code.
I know how much nVidia is involved, and I just don't see the same involvement from AMD, not even close, which is a shame, as their hardware was always very strong in pure compute tasks.
Also, fah devs mainly have nVidia hardware, as far as I know. I do not believe there are any AMD GPUs in their possession. At least we can be content that AMD CPUs punched through Intel wall when it comes to fah

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 20, 2020 10:32 am
by NormalDiffusion
muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)
Yep, still getting 16600 on Radeon VII (2nd Gen Vega) as of today (all time UTC):
Machine 1:
- 19.08.2020 - 20:55
- 19.08.2020 - 22:43
- 20.08.2020 - 10:2x

Machine 2:
- 19.08.2020 - 13:06
- 19.08.2020 - 13:11
- 19.08.2020 - 13:27

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 20, 2020 10:39 am
by muziqaz
NormalDiffusion wrote:
muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU :)
Yep, still getting 16600 on Radeon VII (2nd Gen Vega) as of today (all time UTC):
Machine 1:
- 19.08.2020 - 20:55
- 19.08.2020 - 22:43
- 20.08.2020 - 10:2x

Machine 2:
- 19.08.2020 - 13:06
- 19.08.2020 - 13:11
- 19.08.2020 - 13:27
Thanks, we'll try other means

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 20, 2020 10:42 am
by NormalDiffusion
muziqaz wrote:
Thanks, we'll try other means
But it's a lot less than before! :D

Re: 16600 consistently crashing on AMD Radeon VII

Posted: Thu Aug 20, 2020 10:45 am
by muziqaz
NormalDiffusion wrote:
muziqaz wrote:
Thanks, we'll try other means
But it's a lot less than before! :D
That's not good enough. It was set to exclude everything but Navi. Appearantly the setting failed :D