Page 8 of 8

Re: Failing units, low ppd, and returned units.

Posted: Sun Nov 22, 2015 3:42 am
by Scarlet-Tech
Just got home, I am going to go through the log files, once I find them, and I will upload the errors or any other issues I may have had while I was gone.

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 23, 2015 2:16 am
by Scarlet-Tech
"Vijay Pande
We’ve discovered that the new FahCore_21 is producing more errors than we consider acceptable for some clients. The error rate seems to depend on several factors but most noticeable is that it doesn’t work well with second generation Maxwell GPUs. A few projects have made their way through Advanced testing have been distributed to everyone under the default “FAH” client-type setting. To allow donors to limit this exposure, those projects have been reclassified as “Advanced” which is appropriate for a FahCore that is still under development.
As has always been the case, the “Advanced” setting will give you access to newer projects which may have a higher error rate. It is our intention to provide only the safest assignments with the default setting or you can choose to configure your system to run these advanced projects depending on how frequently you encounter these errors.
These conditions are expected to improve as new projects, new versions of that FahCore, or new versions of the drivers incorporate whatever fixes are required. In the meantime, Work Units which are completed successfully allow scientific research to progress toward even more challenging projects than we’ve done so far."


https://folding.stanford.edu/home/issue ... fahcore21/


Everyone should thank pande group for looking into this. This thread may not have helped much, but Stanford obviously sees the same problem. I guess it doesn't just chalk up to faulty hardware or bad overclock

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 23, 2015 4:01 am
by 7im
I applaud them as well. But what it chalks up to remains to be seen. This would not be the first time that panda group had to program around a hardware problem.

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 23, 2015 4:39 am
by Grandpa_01
It is a hardware problem with the Nvidia 9xx cards and I am pretty sure Nvidia is well aware of it. They work very well for what they were made for but not so good for direct compute software. Why do you think Nvidia forced the 9xx cards to run in P2 state with a lower memory speed on the 9xx Maxwell GPU's which lowers the memory speed to 6008Mhz and LL other generations of Nvidia GPU's run at there default P0 settings.

They appear to be doing something about it with the 9xx Ti cards they are only forcing a P2 memory speed of 6608Mhz and I have not had any problems with my 980 Ti Classifieds. I can not say that about my 980 Classifieds or my 970 I do have 2 - 980 Classifieds that I changed the bios on and lowered the P2 memory speed on that run flawlessly now on core21 WU's. My guess is it is a memory controller issue although it could be a memory issue. Now if you can explain why the things I have done make the card's run flawlessly and why Nvidia has done the things they have done, then perhaps you can explain why you think it is something Stanford has done.

And after all that I have a few more things I would like you to answer. But I would recommend you do your homework before the next round of questions.

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 23, 2015 6:30 am
by Scarlet-Tech
I don't know exactly what is going on. I don't create work units, as you can probably tell. As you probably also tell from the other threads you have been assisting in, quiet a few other people are still experiencing issues as well, even after following your instructions. You may have found a lucky sweet spot.

It is probably Nvidia making graphics cards do graphics things.. But Stanford sees the issue, so answer what ever you feel like. These are graphics cards, not scientific compute cards. They are meant for video games and we are repurposing them for a much better cause. If you have to get bent out of shape that we want to figure out what is causing the issue, feel free. I can tell you I am not an engineer and don't work at Nvidia, and I also am not a scientist working at Stanford, but I am paying out of my butt to make sure that my Nvidia hardware is supporting Stanford and just want it to function well.

As you also may have noticed, I just got home and I am trying to get through all of the issues my system was experiencing while I was 3k miles away.

I had always heard that people on this forum were toxic.. I would rather see people be helpful, and pande group has identified that there is an excess of failed units with the Maxwell architecture.. That is progress toward getting to the bottom of it. If it is a hardware issue and they can identify if it is something specific then maybe they can look to assign to work with specific hardware. That is also progress.

All we want is success. Not anger or trying to belittle people as they try to get to the bottom of large amount of failed work units and lost results.

Feel free to answer anything you want and can. Quite frankly, I want successful work units and I want the work units comleted so that maybe Stanford can find some good results. Failed work units, no matter who is at fault, help NO ONE! Success is a positive thing.

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 23, 2015 7:07 am
by rwh202
Grandpa_01 wrote:It is a hardware problem with the Nvidia 9xx cards and I am pretty sure Nvidia is well aware of it. They work very well for what they were made for but not so good for direct compute software. Why do you think Nvidia forced the 9xx cards to run in P2 state with a lower memory speed on the 9xx Maxwell GPU's which lowers the memory speed to 6008Mhz and LL other generations of Nvidia GPU's run at there default P0 settings.

I'm also now erring towards this being hardware issue. I have 6 970 / 980 cards and 5 of them produce BS at different rates regardless of driver version. I do however have 1 980 that has yet to produce a BS on hundreds of core21 WUs, again with a variety of driver versions - it is an early production reference 980 with reference clocks, pcb, 'Titan' cooler and backplate. I'm guessing this is the version nVidia thoroughly qualified and it does seem to work. Does anyone else fold a purely reference card with success?

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 23, 2015 7:31 am
by bruce
If you have NOT been experiencing a heightened error rate, then keep doing whatever you've been doing.

Inasmuch as the Pande Group has decided on a temporary plan to ameliorate the error rate for those of you who HAVE been experiencing problems, let's not waste everybody's time arguing about what it might or might not be. Suffices to say they're going to come up with something more permanent, and until then what does it matter what we want to call it?

Re: Failing units, low ppd, and returned units.

Posted: Tue Nov 24, 2015 12:57 am
by _r2w_ben
bruce wrote:Inasmuch as the Pande Group has decided on a temporary plan to ameliorate the error rate for those of you who HAVE been experiencing problems, let's not waste everybody's time arguing about what it might or might not be. Suffices to say they're going to come up with something more permanent, and until then what does it matter what we want to call it?
I agree that arguing for the sake of placing blame is pointless. Continuing to discuss the issue may lead to new ideas and help to pinpoint the cause. Up this point, the Pande Group has not announced that they isolated the cause and are working on a definitive solution.

It's important to note that the root cause of the issue on affected hardware may influence the solution.
  1. Purely software that happens to manifest itself more frequently on Nvidia 9xx -> Pande Group/SimTK OpenMM fixes the bug in code
  2. Execution of OpenCL on Nvidia 9xx -> Nvidia driver/hardware fix or Pande Group/SimTK OpenMM workaround. If Intel can make mistakes in silicon, Nvidia could as well.
  3. Sporadic Nvidia 9xx memory controller issue -> If the rate of memory requests in the core was reduced across the board, performance would be left on the table. Is there a way to detect whether a card is affected and only detune when necessary?
  4. GPU/graphics card specific -> What does it mean when a work unit fails on one referenced clocked 970 card but succeeds on another referenced clocked 970? Could the issue be external to the card e.g. PSU quality, PCI-E timing, etc.
I can identify with these 9xx owners' desire to find a solution. When Core 21 launched, my Radeon 6850 failed every one of them!

Re: Failing units, low ppd, and returned units.

Posted: Tue Nov 24, 2015 2:46 am
by bruce
_r2w_ben wrote:Continuing to discuss the issue may lead to new ideas and help to pinpoint the cause. Up this point, the Pande Group has not announced that they isolated the cause and are working on a definitive solution.
Hmmm. I don't remember them EVER making such an announcement.

I do know they're working on something based on a lot of failures they've collected. Does that mean they have the fix(es) in hand? Not necessarily, especially since it might be more than one thing.

Re: Failing units, low ppd, and returned units.

Posted: Wed Dec 09, 2015 8:57 pm
by Scarlet-Tech
This is good news that upon investigation, that was an error in the code that was returning false errors, proving that the issue may not have been nvidia hardware. I guess it pays to investigate both sides rather than assuming nvidia is immediately at fault, and Stanford did just that and found an error that was giving those false positives.

Yes, there may be a slowdown, but completed work units are completed work units, no matter the ppd.

Since these were pulled for investifstion, my production went back up to normal amounts, so this is overall good news for everyone.

Below is the quite explaining the error wasn't hardware, but was coding for anyone that was curious.


by JohnChodera » Mon Dec 07, 2015.

We rolled out Core21 v0.0.14 last week.

Main features are:
* Bad State errors fixed: Improvements that drastically reduce the high rate of Bad State errors we were seeing with earlier versions of the core, especially with NVIDIA cards.
These Bad State errors were ironically caused by a couple of bugs in the code that checks the simulation for integrity every time a checkpoint file is written. The simulation integrity was solid, but there was a high false positive rate for errors. This should be greatly reduced with 0.0.14.
For NVIDIA cards, this unfortunately comes with a 10% performance regression when PME is in use, but we are compensating by rebenchmarking affected core 21 projects. We hope to release an update that will undo this performance regression soon.

* More debugging info: When Bad State errors *do* occur, we now bring back more information to help us diagnose these issues.

* Early NaN detection eliminates slowdowns: We previously had some reports of WU slowdowns that were traced to NaNs appearing in the simulation and slowing down the integration loop drastically. Previously, these were not detected until a checkpoint was written. Now, checks are performed much more frequently, hopefully eliminating these slowdowns.

* Minor improvements: We have also made a number of other minor improvements that give us a bit more flexibility and control over what simulations can be run.

Re: Failing units, low ppd, and returned units.

Posted: Wed Mar 23, 2016 3:10 pm
by Scarlet-Tech
I know it has been 4 months since this thread died off, and that the moderators and naysayers stopped responding.. but I wanted to take a moment to thank Stanford once again. I have not had a single failed unit since the last update and code correction that Stanford applied to the work units, so it goes to show that sometimes the end user is correct and the hardware is not to blame.

My work unit production has drastically increased and my PPD has been steady for months now. Thank you Stanford for taking the time to investigate the issue.

Re: Failing units, low ppd, and returned units.

Posted: Tue Apr 26, 2016 5:33 pm
by Duce H_K_
Kebast only thing I can advice is what I used myself FahSpy & FahLog
Last ver. 2.0.1, jan 13 2010 (size 606 КБ)
scarlet_tech wrote:15:13:02:WU02:FS01:0x21:Version 0.0.12
15:13:41:WU02:FS01:0x21:ERROR:exception: bad allocation
15:13:41:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:13:42:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9205 run:16 clone:52 gen:7 core:0x21

15:14:39:WU03:FS01:0x21:ERROR:exception: bad allocation
15:14:39:WU03:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:14:40:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:9205 run:3 clone:33 gen:9 core:0x21

20:37:20:WU02:FS02:0x21:ERROR:exception: bad allocation
20:37:20:WU02:FS02:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
20:37:20:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:9206 run:0 clone:1351 gen:11 core:0x21

06:27:10:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
06:27:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:27:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:9207 run:0 clone:22 gen:32 core:0x21

03:56:00:WU02:FS01:0x21:ERROR:exception: bad allocation
03:56:00:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
03:56:00:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9209 run:0 clone:50 gen:15 core:0x21
13:52:40:WU02:FS01:Upload complete
13:52:40:WU02:FS01:Server responded WORK_QUIT (404)
13:52:40:WARNING:WU02:FS01:Server did not like results, dumping
Of course somebody noticed It before but I recently saw how high MCU was on my GTX970 running this jobs. Clocking p2-state memory form 1502 to 1850MHz was unacceptable to 9205.
So I made conclusion that all p92xx (except 9201) highly sensitive to memory overclocking

Code: Select all

17:19:05:<config>
17:19:05:  <!-- Folding Core -->
17:19:05:  <checkpoint v='6'/>
17:19:05:  <core-priority v='low'/>
17:19:05:
17:19:05:  <!-- Folding Slot Configuration -->
17:19:05:  <extra-core-args v='-forceasm -twait=80'/>
17:19:05:  <smp v='false'/>
17:19:05:
17:19:05:  <!-- HTTP Server -->
17:19:05:  <allow v='0.0.0.0/0'/>
17:19:05:  <deny v='255.255.255.255/255.255.255.255'/>
17:19:05:
17:19:05:  <!-- Network -->
17:19:05:  <proxy v=':8080'/>
17:19:05:
17:19:05:  <!-- Remote Command Server -->
17:19:05:  <command-deny-no-pass v=''/>
17:19:05:  <command-port v='7936'/>
17:19:05:  <password v='************'/>
17:19:05:
17:19:05:  <!-- Slot Control -->
17:19:05:  <pause-on-battery v='false'/>
17:19:05:  <pause-on-start v='true'/>
17:19:05:  <power v='full'/>
17:19:05:
17:19:05:  <!-- User Information -->
17:19:05:  <passkey v='********************************'/>
17:19:05:  <team v='47191'/>
17:19:05:  <user v='Duce-HK_ALL_1P1LY4fT4YkPDEhNRFFkCQcTxbc2gr1UoB'/>
17:19:05:
17:19:05:  <!-- Work Unit Control -->
17:19:05:  <max-units v='16'/>
17:19:05:  <next-unit-percentage v='100'/>
17:19:05:
17:19:05:  <!-- Folding Slots -->
17:19:05:  <slot id='0' type='GPU'>
17:19:05:    <opencl-index v='1'/>
17:19:05:  </slot>
17:19:05:</config>

20:09:57:WU01:FS00:Requesting new work unit for slot 00: RUNNING gpu:1:GM204 [GeForce GTX 970] from 171.64.65.104
20:09:58:WU01:FS00:Connecting to 171.64.65.104:8080
20:09:59:WU01:FS00:Downloading 80.07MiB
20:10:05:WU01:FS00:Download 2.81%
20:10:10:WU00:FS00:Upload 20.51%
20:10:11:WU01:FS00:Download 6.24%
20:10:17:WU01:FS00:Download 8.35%
20:10:23:WU01:FS00:Download 14.75%
20:10:29:WU01:FS00:Download 21.93%
20:10:35:WU01:FS00:Download 25.29%
20:10:41:WU01:FS00:Download 28.80%
20:10:47:WU01:FS00:Download 32.16%
20:10:53:WU01:FS00:Download 35.52%
20:10:59:WU01:FS00:Download 39.11%
20:11:05:WU01:FS00:Download 42.62%
20:11:11:WU01:FS00:Download 46.68%
20:11:17:WU01:FS00:Download 51.36%
20:11:23:WU01:FS00:Download 54.49%
20:11:29:WU01:FS00:Download 58.08%
20:11:35:WU01:FS00:Download 61.82%
20:11:41:WU01:FS00:Download 65.96%
20:11:47:WU01:FS00:Download 70.49%
20:11:53:WU01:FS00:Download 74.94%
20:11:59:WU01:FS00:Download 79.39%
20:12:05:WU01:FS00:Download 83.91%
20:12:11:WU01:FS00:Download 88.36%
20:12:17:WU01:FS00:Download 95.70%
20:12:19:WU01:FS00:Download complete
20:12:19:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9205 run:30 clone:0 gen:18 core:0x21 unit:0x00000021664f2dd056fb26ad6039adc3
20:12:19:WU01:FS00:Starting
20:12:19:WU01:FS00:Running FahCore: C:\NO_UAC\FAHClient7/FAHCoreWrapper.exe E:/Docbase/FaH-workdir/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 01 -suffix 01 -version 704 -lifeline 4452 -checkpoint 6 -gpu 1 -gpu-vendor nvidia -forceasm -twait=80
20:12:19:WU01:FS00:Started FahCore on PID 2572
20:12:20:WU01:FS00:Core PID:540
20:12:20:WU01:FS00:FahCore 0x21 started
20:12:21:WU01:FS00:0x21:*********************** Log Started 2016-04-23T20:12:21Z ***********************
20:12:21:WU01:FS00:0x21:Project: 9205 (Run 30, Clone 0, Gen 18)
20:12:21:WU01:FS00:0x21:Unit: 0x00000021664f2dd056fb26ad6039adc3
20:12:21:WU01:FS00:0x21:CPU: 0x00000000000000000000000000000000
20:12:21:WU01:FS00:0x21:Machine: 0
20:12:21:WU01:FS00:0x21:Reading tar file core.xml
20:12:21:WU01:FS00:0x21:Reading tar file integrator.xml
20:12:21:WU01:FS00:0x21:Reading tar file state.xml
20:12:21:WU01:FS00:0x21:Reading tar file system.xml
20:12:22:WU01:FS00:0x21:Digital signatures verified
20:12:22:WU01:FS00:0x21:Folding@home GPU Core21 Folding@home Core
20:12:22:WU01:FS00:0x21:Version 0.0.17
20:13:02:WU01:FS00:0x21:Completed 0 out of 2500000 steps (0%)
20:13:02:WU01:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
20:18:27:WU01:FS00:0x21:Completed 25000 out of 2500000 steps (1%)
20:22:42:WU01:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
20:33:32:WU01:FS00:0x21:Completed 50000 out of 2500000 steps (2%)
20:38:58:WU01:FS00:0x21:Completed 75000 out of 2500000 steps (3%)
20:44:23:WU01:FS00:0x21:Completed 100000 out of 2500000 steps (4%)
20:49:54:WU01:FS00:0x21:Completed 125000 out of 2500000 steps (5%)
20:55:20:WU01:FS00:0x21:Completed 150000 out of 2500000 steps (6%)
21:00:46:WU01:FS00:0x21:Completed 175000 out of 2500000 steps (7%)
21:06:12:WU01:FS00:0x21:Completed 200000 out of 2500000 steps (8%)
21:11:46:WU01:FS00:0x21:Completed 225000 out of 2500000 steps (9%)
******************************* Date: 2016-04-23 *******************************
21:17:12:WU01:FS00:0x21:Completed 250000 out of 2500000 steps (10%)
21:22:38:WU01:FS00:0x21:Completed 275000 out of 2500000 steps (11%)
21:28:04:WU01:FS00:0x21:Completed 300000 out of 2500000 steps (12%)
21:29:26:WU01:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
21:34:54:WU01:FS00:0x21:Completed 325000 out of 2500000 steps (13%)
21:40:21:WU01:FS00:0x21:Completed 350000 out of 2500000 steps (14%)
21:45:50:WU01:FS00:0x21:Completed 375000 out of 2500000 steps (15%)
21:51:19:WU01:FS00:0x21:Completed 400000 out of 2500000 steps (16%)
21:56:56:WU01:FS00:0x21:Completed 425000 out of 2500000 steps (17%)
22:02:27:WU01:FS00:0x21:Completed 450000 out of 2500000 steps (18%)
22:07:56:WU01:FS00:0x21:Completed 475000 out of 2500000 steps (19%)
22:13:25:WU01:FS00:0x21:Completed 500000 out of 2500000 steps (20%)
22:19:06:WU01:FS00:0x21:Completed 525000 out of 2500000 steps (21%)
22:24:35:WU01:FS00:0x21:Completed 550000 out of 2500000 steps (22%)
22:30:05:WU01:FS00:0x21:Completed 575000 out of 2500000 steps (23%)
22:35:35:WU01:FS00:0x21:Completed 600000 out of 2500000 steps (24%)
22:38:33:WU01:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
22:38:33:WU01:FS00:0x21:ERROR:Max Retries Reached
22:38:33:WU01:FS00:0x21:Saving result file logfile_01.txt
22:38:33:WU01:FS00:0x21:Saving result file badstate-0.xml
22:38:34:WU01:FS00:0x21:Saving result file badstate-1.xml
22:38:34:WU01:FS00:0x21:Saving result file badstate-2.xml
22:38:34:WU01:FS00:0x21:Saving result file log.txt
22:38:34:WU01:FS00:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
22:38:36:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:38:36:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:9205 run:30 clone:0 gen:18 core:0x21 unit:0x00000021664f2dd056fb26ad6039adc3

Re: Failing units, low ppd, and returned units.

Posted: Tue Jul 19, 2016 1:44 am
by Phildesign
Am I missing something, Low PPD again

Running Folding by itself and made sure there is at least one core spare for each gpu, when I add boinc the figures don't change.

Zotac GeForce GTX 980 Ti
2816 cores
project 10493 estimated PPD 4999

GT 710
Cores 192
project 9162 estimated PPD 294244

Thanks

Re: Failing units, low ppd, and returned units.

Posted: Tue Jul 19, 2016 1:59 am
by bruce
Phildesign wrote:Am I missing something, Low PPD again

Running Folding by itself and made sure there is at least one core spare for each gpu, when I add boinc the figures don't change.
Oh, but they do, but not immediately.

It;s hard to tell from what you posted, but the PPD reported by FAHClient is a projection based on recent history. It takes hours to stabilize, especially when you allocate significant resources to running something else.

As a general rule, BOINC and FAH do NOT work well together. If you choose to run both, you'll probably do much better to spend a week or more running FAH followed by a week or more running BOINC rather than trying to run them concurrently.

My experience with BOINC is decidedly out-dated, but at the time I ran it (years ago) they both competed for same CPU resources. Looking only at CPU projects by temporarily assuming that neither one uses GPUs, note that FAH awards both basic points based on the quantity and complexity of WUs completed PLUS bonus points based on speed of completion (time of uploading results minus time of the original download of the WU). Allowing any FAH WU to sit idle is very costly to PPD. Also, bonus points start accumulating after you've successfully used a passkey on 10 or more WUs.

Dedicating all your CPUs to FAH for long enough to complete some WUs will earn a very significant number of bonus points compared to running half of those CPUs for twice as long. This is decidedly different than BOINC.

Looking at FAH's GPU WUs, you do understand that you leave one free CPU per GPU, but if that CPU gets used by BOINC, it's not free. Set up your system with it's normal workload except without FAH folding on GPUs. Are the right number of CPUs idle?

Re: Failing units, low ppd, and returned units.

Posted: Wed Jul 20, 2016 10:05 am
by foldy
@Phildesign: Is it possible you mixed up the values? GTX 980 Ti gets 294244 and GT 710 gets 4999?
Can you post your folding logfile? Is your PC running 24/7 ?