Page 1 of 1

A trio of bad WU (11713 & 11432)

Posted: Sun Jan 14, 2018 7:57 pm
by ikek
Hello,

I have been folding for a couple of weeks and everything has been running more or less smooth. Once yesterday and twice today I had WU go bad and I am wondering if there are any explanations for this peculiar behaviour. All bad WU's have occured during normal computer (non-gpu intensive) like writing this post. It is somewhat annoying as they have failed at 43, 66 and 17 percent respectively. This hurts PPD.

Below are excerpts from the log. Added is date and project number. I will link full log when I am permitted by the forum to do so.

Code: Select all

13/01/18
19:39:43:WU01:FS01:0x21:Project: 11713 (Run 16, Clone 69, Gen 0)

20:49:53:WU01:FS01:0x21:Completed 3225000 out of 7500000 steps (43%)
20:50:22:WU01:FS01:0x21:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-5)
20:50:22:WU01:FS01:0x21:Saving result file logfile_01.txt
20:50:22:WU01:FS01:0x21:Saving result file log.txt
20:50:22:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
20:50:24:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:50:24:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11713 run:16 clone:69 gen:0 core:0x21 unit:0x000000008ca304e75a5a5225898be098

14/01/18
17:06:43:WU01:FS01:0x21:Project: 11432 (Run 0, Clone 907, Gen 3)

19:10:21:WU01:FS01:0x21:Completed 3300000 out of 5000000 steps (66%)
19:10:38:WU01:FS01:0x21:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-5)
19:10:38:WU01:FS01:0x21:Saving result file logfile_01.txt
19:10:38:WU01:FS01:0x21:Saving result file log.txt
19:10:38:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
19:10:40:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:10:40:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:11432 run:0 clone:907 gen:3 core:0x21 unit:0x000000038ca304e85a5a6c408d14cabf

19:10:52:WU00:FS01:0x21:Project: 11432 (Run 1, Clone 628, Gen 1)
19:10:52:WU00:FS01:0x21:Unit: 0x000000028ca304e85a5a6c657bf2cf7c

19:42:55:WU00:FS01:0x21:Completed 850000 out of 5000000 steps (17%)
19:44:12:WU00:FS01:0x21:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-5)
19:44:13:WU00:FS01:0x21:Saving result file logfile_01.txt
19:44:13:WU00:FS01:0x21:Saving result file log.txt
19:44:13:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
19:44:14:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:44:14:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11432 run:1 clone:628 gen:1 core:0x21 unit:0x000000028ca304e85a5a6c657bf2cf7c

Re: A trio of bad WU (11713 & 11432)

Posted: Mon Jan 15, 2018 1:43 am
by bruce
The message clEnqueueReadBuffer indicates that your GPU reported an error. The code 5 means CL_OUT_OF_RESOURCES, In fact, that code frequently means that the GPU is reset by the driver. (You'll probably find a report of the driver-reset in the event log.)

Continuing, the driver-reset generally means that the GPU is overheating or is unstable due to overclocking. Reducing the overclock settings general cures either one of them but adding a case fan or increasing the fan speed might take care of it.

Re: A trio of bad WU (11713 & 11432)

Posted: Mon Jan 15, 2018 5:09 am
by Kuno
Bruce really knows the error codes well. I wish there was a wiki somewhere that we could go to, to find out what these errors mean.

Re: A trio of bad WU (11713 & 11432)

Posted: Mon Jan 15, 2018 10:49 am
by ikek
Bruce really knows the error codes well. I wish there was a wiki somewhere that we could go to, to find out what these errors mean.
I second that there ought to be an overview of typical errors and what commonly causes these readily available. I googled but could not find any reliable information and hence this thread.

Temps are not an issue, these are 65-70C (GPU). It is more likely the overclock. Then again it has been running at 2037 (core clock) and 2050 for a couple of weeks with no issues. This includes both the machine running unattended and the folding client running at full while the computer is used in light tasks.

I have downclocked the GPU and it seems fah stable. Time will tell if it needs adjustment.

Re: A trio of bad WU (11713 & 11432)

Posted: Mon Jan 15, 2018 7:33 pm
by kiore
Some work units, in my experience, less tolerant to overclocking so even when a setting stable for months a new work unit fails on it.

Re: A trio of bad WU (11713 & 11432)

Posted: Mon Jan 15, 2018 7:37 pm
by bruce
OpenCL error codes can be found in several places including the official OpenCL site. https://www.khronos.org/registry/OpenCL ... rrors.html You have to dig a little to find what you're looking for in their manual.

Actually, this list is pretty useful: https://tersetalk.wordpress.com/2012/04 ... ror-codes/

Each software component used by FAH can issue it's own error messages can come from FAH, itself, from OpenCL, from your OS, etc. I have not found a comprehensive list. (I often have to google the error.)

Re: A trio of bad WU (11713 & 11432)

Posted: Mon Jan 15, 2018 7:43 pm
by bruce
kiore wrote:Some work units, in my experience, less tolerant to over-clocking so even when a setting stable for months a new work unit fails on it.
That's always true when you over-clock. First you have to find a group of benchmarks that use a maximum amount of each specific computing resource. Since there are additional variations possible, you have to add a safety margin so that in the worst case scenario stability is maintained.

In your case, your over-clock was stable for some WUs that use less than all of the resources but when you happen to get a more efficient project, it exceeds whatever margin you have allowed.

Re: A trio of bad WU (11713 & 11432)

Posted: Tue Jan 16, 2018 1:44 pm
by ikek
I would like to thank everyone for their contributions.

After a downclock of the core clock everything seems stable ( by 37-50). The odd thing is that, from memory, the system completed several of the heavy WUs without issue (nothing in log) when running at 2037 and the computer was utilized in the same light manner. Had it hit a wall from the first one of these then it would have been more apparant. Fortunatly it was an user error which is easily remedied.

Ill be daring enough to suggest that if the contents of bruce's post (two up), or something similar, was put into its own thread and stickied it could prevent a thread or two like mine. It never even crossed my mind to look into OpenCL error codes and I think many people will go here to look for information pertaining to error reporting in the fahcontrol log.

with regards

Re: A trio of bad WU (11713 & 11432)

Posted: Wed Jan 17, 2018 12:52 am
by bruce
It's evident that some people can complete these projects and are pleased with the results.
Subject: Why are projects 9415 and 9414 such low PPD?
Luscious wrote:Back with an update here and it's evident projects 11432 and 11713 are putting my cards back at their previous performance level.

No system changes made whatsoever.
I guess 1171x are just a bit too efficient for your overclock and 941x are just enough LESS efficient (with correspondingly lower PPD) to run on everybody's machine.

Re: A trio of bad WU (11713 & 11432)

Posted: Wed Jan 17, 2018 11:20 pm
by sticks435
I'm having the same issue with 11432 and 11431. 9941x and 1171x will fold just fine on my standard folding overclock, but soon as I hit a 1143x unit, it will fold somewhere between 0 and 10% then fail. I removed all manual overclock on my 980Ti Hybrid and am using the Nvidia stock settings and pretty sure it still failed (am at work at the moment so can't remember/verify). Out of box Evga boost clock is 1228 and mine runs at 1341 with default settings. Will have to inspect my logs when I get home and see what they say.

EDIT: Checked my logs, I have the exact same error as OP. Doesn't look like I have tried to fold one of these units since reverting to default settings.