Page 4 of 4
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Apr 18, 2020 4:25 pm
by Jan
muziqaz wrote:If one person reports Faulty WU, we question that person's hardware, of two or more return the same WU as faulty, we start questioning the WU
Maybe a bit off topic - but this WU is already in generation 50. How can it be faulty? Does my understanding of the whole process still lack?
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Apr 18, 2020 4:55 pm
by Simplex0
Jan wrote:Client-type advanced will not fix any errors. It will simply make your client looking for advanced WUs (which are WUs that just made it out of beta testing) additionally to "normal" WUs and thats it. Afaik.
muziqaz might have a point, as this WU has been returned 2 or 3 times as faulty. Have you had other WUs on your GPUs so far/since then?
Well, in my case my hope was that using the 'advanced' settings would result in less errors because it would result in downloading more of newer and better coded applications as indicated by Neil-B earlier in this thread but in my case it did not help.
I have checked the logs to day from my computers running the GTX1070 and RTX2080 and have no errors on work units running on those cards.
I will not run any more folding on my Radeon cards for a while, I check back in the Forum in few weeks to see if the problem is still there.
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Apr 18, 2020 5:13 pm
by Neil-B
Simplex0 wrote:… as indicated by Neil-B earlier in this thread but in my case it did not help.
.. tbh, I wasn't recommending such as a solution, simply trying to explain to a previous poster one reason why they might be seeing their issue when folding FAH but not when folding ADV … Apologies if it came across that I was promoting this as a solution.
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Apr 18, 2020 5:24 pm
by muziqaz
Jan wrote:muziqaz wrote:If one person reports Faulty WU, we question that person's hardware, of two or more return the same WU as faulty, we start questioning the WU
Maybe a bit off topic - but this WU is already in generation 50. How can it be faulty? Does my understanding of the whole process still lack?
project is in gen50, WU is a single Work Unit you download to process
That particular WU is most likely Faulty, but it does not mean the whole project or even Generation is faulty as well
Simplex0 wrote:
I will not run any more folding on my Radeon cards for a while, I check back in the Forum in few weeks to see if the problem is still there.
It is your choice of course, but in my opinion this solution is too drastic. This type of error you are encountering is not very frequent, and is independent of type of GPU folder is running. Currently these random failed WUs are acceptable, and known. Devs are working on better handling of these errors, though.
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Apr 18, 2020 5:31 pm
by Jan
muziqaz wrote:project is in gen50, WU is a single Work Unit you download to process
That particular WU is most likely Faulty, but it does not mean the whole project or even Generation is faulty as well
Sure. I just didnt think the new WUs after so many generations could still (or rather: newly) be faulty. I probably dont understand the generating process of these WUs well enough. And now I'm done derailing this thread.
Re: AMD GPU Error sortShortList on some projects
Posted: Sat Apr 18, 2020 5:33 pm
by Simplex0
Neil-B wrote:Simplex0 wrote:… as indicated by Neil-B earlier in this thread but in my case it did not help.
.. tbh, I wasn't recommending such as a solution, simply trying to explain to a previous poster one reason why they might be seeing their issue when folding FAH but not when folding ADV … Apologies if it came across that I was promoting this as a solution.
No problem, the fact is that sam6861 observed a reduction in errors after using this settings and it was worth trying.
It is usually like that, you observe a change and come up with a assumption on WHY that happened.
That can finally turn out to be wrong but is was still a plausible explanation at that time.
Re: AMD GPU Error sortShortList on some projects
Posted: Sun Apr 19, 2020 9:28 am
by Simplex0
muziqaz wrote:Jan wrote:muziqaz wrote:If one person reports Faulty WU, we question that person's hardware, of two or more return the same WU as faulty, we start questioning the WU
It is your choice of course, but in my opinion this solution is too drastic. This type of error you are encountering is not very frequent, and is independent of type of GPU folder is running. Currently these random failed WUs are acceptable, and known. Devs are working on better handling of these errors, though.
Fact is that this type of errors was very frequent on my R9 290 cards and close to nonexistent on my Nvidia cards, I have observed a lot of this type of errors on my AMD cards lately an non on my Nvidia cards.
I am wondering if this type of work units are sent more frequently to specifically AMD cards maybe? I will try to dig in a little deeper next week.
For now I can say that in the log files covering 15 days on my computer running Nvidia cards I have 0 cases of Bad state detected, BAD WORK UNIT(114=0x72)
On the computer with AMD R9 290 cards I have in the log files covering 9 days found 8 work units which resulted in Bad state detected, BAD WORK UNIT(114=0x72
Thank you all for your support
Re: AMD GPU Error sortShortList on some projects
Posted: Sun Apr 19, 2020 9:52 am
by muziqaz
Simplex0 wrote:
Fact is that this type of errors was very frequent on my R9 290 cards and close to nonexistent on my Nvidia cards, I have observed a lot of this type of errors on my AMD cards lately an non on my Nvidia cards.
I am wondering if this type of work units are sent more frequently to specifically AMD cards maybe? I will try to dig in a little deeper next week.
For now I can say that in the log files covering 15 days on my computer running Nvidia cards I have 0 cases of Bad state detected, BAD WORK UNIT(114=0x72)
On the computer with AMD R9 290 cards I have in the log files covering 9 days found 8 work units which resulted in Bad state detected, BAD WORK UNIT(114=0x72
Thank you all for your support
So maybe it is time to clean the fans of the card, and maybe reduce the clocks
290 is VERY old card, it's possible that VRMs are on their last legs
Re: AMD GPU Error sortShortList on some projects
Posted: Thu Apr 23, 2020 10:27 am
by Simplex0
muziqaz wrote:Simplex0 wrote:
Fact is that this type of errors was very frequent on my R9 290 cards and close to nonexistent on my Nvidia cards, I have observed a lot of this type of errors on my AMD cards lately an non on my Nvidia cards.
I am wondering if this type of work units are sent more frequently to specifically AMD cards maybe? I will try to dig in a little deeper next week.
For now I can say that in the log files covering 15 days on my computer running Nvidia cards I have 0 cases of Bad state detected, BAD WORK UNIT(114=0x72)
On the computer with AMD R9 290 cards I have in the log files covering 9 days found 8 work units which resulted in Bad state detected, BAD WORK UNIT(114=0x72
Thank you all for your support
So maybe it is time to clean the fans of the card, and maybe reduce the clocks
290 is VERY old card, it's possible that VRMs are on their last legs
The computer is all water cooled, custom loop, and the temperature on the GPU and VRM on my graphic cards stays under 65 °C at all time.
You are right regarding the fact that it is indeed very old graphic cards and that seams to be the problem, after reducing the GPU-clock to 80% on all cards everything works just fine now.
Thank you for your support muziqaz.
Re: AMD GPU Error sortShortList on some projects
Posted: Tue May 05, 2020 12:18 am
by bruce
The errors with the keyword "sortShortList" are unique to AMD GPUs and simply do not occur on nV hardware.
"Bad state detected, BAD WORK UNIT(114=0x72)" covers that case as well as several other possibilities across both brands of GPUs. If you eliminate the sortShortList errors, are the Bad State errors about the same?
Re: AMD GPU Error sortShortList on some projects
Posted: Tue May 05, 2020 2:29 am
by JohnChodera
Just a quick note here: We've fixed this issue in OpenMM:
https://github.com/openmm/openmm/pull/2631
We're just working on backporting the fix into core22.
Thanks for your patience!
~ John Chodera // MSKCC