Page 1 of 1

What does this error mean?

Posted: Sat Dec 29, 2018 3:01 pm
by Prettz
I was scanning the log for errors after noticing some bizarrely low points on certain stat updates, and, although I didn't find anything causing that, I did notice some strange errors from the GPU client that have occurred several times over the past week. The GPU core says it "probably crashed", but there's no failure of the WU, it just picks up from a checkpoint. The latest instance just occurred now while I was looking at the log, so I was able to see what's going on with other stuff just as it happened (nothing).

Code: Select all

14:29:42:WU01:FS01:0x21:Completed 7500000 out of 12500000 steps (60%)
14:31:31:WU02:FS00:0xa7:Completed 255000 out of 500000 steps (51%)
14:32:21:WU01:FS01:0x21:Completed 7625000 out of 12500000 steps (61%)
14:32:24:WARNING:WU01:FS01:FahCore returned an unknown error code which probably indicates that it crashed
14:32:24:WARNING:WU01:FS01:FahCore returned: UNKNOWN_ENUM (127 = 0x7f)
14:32:24:WU01:FS01:Starting
14:32:24:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/User/AppData/Roaming/FAHClient/cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 01 -suffix 01 -version 704 -lifeline 6208 -checkpoint 10 -gpu 0 -gpu-vendor nvidia
14:32:24:WU01:FS01:Started FahCore on PID 4620
14:32:24:WU01:FS01:Core PID:11624
14:32:24:WU01:FS01:FahCore 0x21 started
14:32:25:WU01:FS01:0x21:*********************** Log Started 2018-12-29T14:32:25Z ***********************
14:32:25:WU01:FS01:0x21:Project: 14147 (Run 2, Clone 19, Gen 89)
14:32:25:WU01:FS01:0x21:Unit: 0x000000610002894c5c0554eb0dbf966a
14:32:25:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
14:32:25:WU01:FS01:0x21:Machine: 1
14:32:25:WU01:FS01:0x21:Digital signatures verified
14:32:25:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
14:32:25:WU01:FS01:0x21:Version 0.0.18
14:32:25:WU01:FS01:0x21:  Found a checkpoint file
14:32:27:WU01:FS01:0x21:Completed 7600000 out of 12500000 steps (60%)
14:32:27:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
14:32:58:WU01:FS01:0x21:Completed 7625000 out of 12500000 steps (61%)
The client GUI doesn't indicate anything happened, it just keeps going as normal. And GPU temperature monitor shows nothing unusual is happening with it. The video drivers didn't crash (and that always causes WU to fail anyway). So what is this "crash" that doesn't cause a failed WU, and do I need to be concerned?

Re: What does this error mean?

Posted: Sat Dec 29, 2018 6:42 pm
by foldy
Stats are sometimes some a day behind, so you first get very low PPD on one day and double PPD on the second day when stats server catched up.

Maybe in the FAhCore logfile it says what is disturbing it. e.g. C:\ProgramData\FAHClient\work\01\01\log.txt

Re: What does this error mean?

Posted: Sun Dec 30, 2018 12:23 am
by bruce
FAHCore_21 has been known to crash like that. Nobody has determined the reason though a number of theories have been suggested.

The FAHClient was specifically redesigned to retry from the previous checkpoint and, for the most part, it resumes work and completes the WU. If it fails 3x, though, it will abort the WU to allow for an appropriate error recovery in case you have a corrupt checkpoint or something like that.

Re: What does this error mean?

Posted: Sun Dec 30, 2018 4:01 pm
by JimF
Prettz wrote:The GPU core says it "probably crashed", but there's no failure of the WU, it just picks up from a checkpoint. The latest instance just occurred now while I was looking at the log, so I was able to see what's going on with other stuff just as it happened (nothing).
Several years ago, I reported here an unusual crash like that when I was running a VirtualBox project under BOINC. I don't remember which project it was, but when I stopped it the errors went away. They were harmless, except for the loss of a few minutes of folding time, but I was concerned that worse might happen.

Of course, the other possibility is overclocking your card, and factory overclocks are the same as software overclocks insofar as crunching is concerned.

Re: What does this error mean?

Posted: Sun Dec 30, 2018 6:41 pm
by ProDigit
if you're running from a regular desktop PC, or notebook, errors can be due to a variety of reasons.
There are billions of bytes being read and written per second, and errors are normal to happen.
In computer hardware there exists ECC. Error Correction Codes.
It's a code, much like a hashes, the code is there as a validation if the data is correct or not.
The ECC doesn't contain the data, but just an algorithm, that verifies every so many bytes.
This algorithm is not 100% flawless.
According to google, about 25 to 70 errors can occur every million calculations.
Most of them have minor consequences, as they can be corrected by the ECC algorithm.
Some of them result in system crashes, halts, freezes, or very few to hardware failures. But the latter is very uncommon.

If you want safer than what your PC offers, you'll have to equip it with server RAM, with ECC.
A lot of motherboards don't support ECC memory.
Supposedly it's algorithms are so strong, an error is not supposed to appear during the lifetime of the PC or RAM (usually 10 years, unless the hardware fails, or is damaged).
They even have a chip on the memory to do the checking, that if an error does occur, the ECC can correct the error.
ECC supposedly is built into the CPU for CPU computation corrections.
ECC memory reduces the chance on errors from CPU to RAM to CPU.

Graphic cards on the other hand, don't have ECC memory. Unless you get one of those high end $2k+ cards that are more for businesses than for gaming or folding.
So while ECC memory saves you from crashing horrible on the RAM side, it doesn't do anything on the GPU side.

If in the future, NVidia or AMD wants to create mining or folding cards, it would be best if they equipped the cards with ECC memory.
At the cost of a few percents of a percent of slowdown, vs non-ECC memory, it's a valuable trade off.

In games, errors in GPU or VRAM, result in glitches, that usually aren't repeatable.
In folding apps, even one glitch can produce faulty work units, with zero credits given to the faulty WU.

I'm not sure if the entire WU is being discarded, or if only the erroneous part of the WU is discarded.

Since most of the computations in folding and bitcoin mining is done in the GPU, those computations will thus also be more vulnerable to errors.
Thankfully FAH doesn't use a lot of RAM (between 100MB to 800MB of VRAM per card I see).

And, like another user said, overclocking can make the system unstable, and nullify your PPDs on certain WUs.

Re: What does this error mean?

Posted: Sun Dec 30, 2018 6:52 pm
by foldy
I'm not sure if the entire WU is being discarded, or if only the erroneous part of the WU is discarded.
The work unit first retries from the last checkpoint and continues. Only after 3 failures the work unit is discarded and reassigned to another user.

As long as you don't get the error "Bad work unit" then your hardware is stable.

Re: What does this error mean?

Posted: Sun Dec 30, 2018 7:15 pm
by Joe_H
ProDigit wrote:f in the future, NVidia or AMD wants to create mining or folding cards, it would be best if they equipped the cards with ECC memory.
At the cost of a few percents of a percent of slowdown, vs non-ECC memory, it's a valuable trade off.
They already have GPU cards designed for continuous computing equipped with ECC VRAM, they are the workstation grade cards you already alluded to. They are more expensive because of the more expensive RAM and also because they use GPU chips that are binned for continuously processing numbers without errors occurring. As part of that they also usually run at a lower clock than the same chips used in consumer grade GPU's.

F@h is designed to run on consumer level hardware and OS's. It can take advantage of higher grade hardware, but has been designed to deal with the occasional errors that can occur. As a niche market I do not see any chance that AMD or nVidia would create a third line of cards for just folding and mining.

Re: What does this error mean?

Posted: Sun Dec 30, 2018 8:38 pm
by ProDigit
They were initially thinking of doing just that, creating cards specifically targeted to mining.
That was before the 'bitcoin crash' earlier this year, coinciding with the oversupply of bitcoin miners.
The GTX 1060 was initially created for bitcoin mining, and rumors were that Nvidia was going to release a lower RAM, higher CPU, port-less card specifically for mining and folding; in an effort to reduce cost on those cards.
But when the market collapsed and the oversupply that is currently there became apparent, and AMD stepped in switching to 7nm cards (down from 14nm), ideas were scrapped.

No ECC cards for bitcoin mining, but ideas for cards specifically for mining were coined, especially by Nvidia.

Re: What does this error mean?

Posted: Sun Dec 30, 2018 9:29 pm
by bruce
ECC memory uses 9 bits instead of 8 so it typically costs about 10% more and the RAM uses about 10% more power. Next time you get ready to buy a new GPU, which will you choose? Decisions made by the typical home computer owner tend to be made based on prices.

For the typical gamer, a one-bit error that makes a slight color alteration of one pixel on one frame of your action game will almost never be noticed. On the other hand, commercial servers generally process a lot more critical data. Given that the GPU manufacturer designs their product for business servers or for gamers (and won't be adding an intermediate level platform) you have to make the choice yourself.

Next time you consider a GeForce GPU, find a Quadro with similar features and make your own choice (or do the same for ATI). You'll probably find that the "professiona" GPU (with ECC) is as much as 4x the price of the "consumer" oriented card (without ECC). Certainly there will be other differences (like Double Precision GFLOPS) but the professional cards sell to a much smaller marketplace so they sell at a premium price whereas the price completion for consumer cards is a much more significant.