Page 1 of 2

Project 5113

Posted: Sun Aug 31, 2008 9:53 am
by ZVdP
The first time I have received this project I encountered 2 crashes of FahCore_a0.exe.
These crashes happened when another application (game) was also running at the same time on the computer. I was not sure if that was the origin of the problem or not, but now the same thing has happened with a new assignment of this project under the same conditions.
After restarting the client, the project continues to fold normally, and the first project was sent after completion without any further problems.

My CPU is overclocked, but I don't think it is the culprit, since this has happened with this specific project only.

Code: Select all

[08:18:32] Folding@Home Gromacs 3.3 Core
[08:18:32] Version 1.93 (July 23, 2008)
[08:18:32] 
[08:18:32] Preparing to commence simulation
[08:18:32] - Looking at optimizations...
[08:18:32] - Files status OK
[08:18:32] - Expanded 1160634 -> 6173133 (decompressed 531.8 percent)
[08:18:33] 
[08:18:33] Project: 5113 (Run 18, Clone 82, Gen 1)
[08:18:33] 
[08:18:34] Assembly optimizations on if available.
[08:18:34] Entering M.D.
[08:18:40] FAH Init
[08:18:40] Checkpoint file: 
[08:18:45] (Starting from checkpoint)
[08:18:45] Read checkpoint
[08:18:45] Protein: Calmodulin in water
[08:18:45] Writing local files
[08:18:46] Completed 165000 out of 500000 steps  (33 percent)
[08:18:46] Extra SSE boost OK.
[08:30:33] Writing local files
[08:30:33] Completed 170000 out of 500000 steps  (34 percent)
[08:42:22] Writing local files
[08:42:22] Completed 175000 out of 500000 steps  (35 percent)
[08:54:15] Writing local files
[08:54:15] Completed 180000 out of 500000 steps  (36 percent)
[08:59:27] Quit 101 - NaN detected: (ener[28])
[08:59:27] 
[08:59:27] Simulation instability has been encountered. The run has entered a
[08:59:27]   state from which no further progress can be made.
[08:59:27] This may be the correct result of the simulation, however if you
[08:59:27]   often see other project units terminating early like this
[08:59:27]   too, you may wish to check the stability of your computer (issues
[08:59:27]   such as high temperature, overclocking, etc.).
[08:59:27] Going to send back what have done.
[09:00:55] CoreStatus = FF (255)
[09:00:55] Client-core communications error: ERROR 0xff
[09:00:55] This is a sign of more serious problems, shutting down.

..........................................................RESTART OF FAH

[09:31:09] Folding@Home Gromacs 3.3 Core
[09:31:09] Version 1.93 (July 23, 2008)
[09:31:09] 
[09:31:09] Preparing to commence simulation
[09:31:09] - Ensuring status. Please wait.
[09:31:26] - Looking at optimizations...
[09:31:26] - Working with standard loops on this execution.
[09:31:26] - Previous termination of core was improper.
[09:31:26] - Files status OK
[09:31:27] - Expanded 1160634 -> 6173133 (decompressed 531.8 percent)
[09:31:27] 
[09:31:27] Project: 5113 (Run 18, Clone 82, Gen 1)
[09:31:27] 
[09:31:27] Entering M.D.
[09:31:33] FAH Init
[09:31:33] Checkpoint file: 
[09:31:38] (Starting from checkpoint)
[09:31:38] Read checkpoint
[09:31:38] Protein: Calmodulin in water
[09:31:38] Writing local files
[09:31:39] Completed 180000 out of 500000 steps  (36 percent)

Re: Project 5113

Posted: Mon Sep 01, 2008 3:46 am
by arfyness
I've had a similar situation several times with this same core, but with Project 781.
In my case, the errors are different. Several occasions of ERROR 0x0 and a couple others, including "NaN detected"

My post is here

Maybe it's something with these projects, or maybe we've found a bug in the Gromacs 3.3 core.

-- Nate

Re: Project 5113

Posted: Mon Sep 01, 2008 4:51 am
by kasson
The linux and windows versions of the A0 core are pretty different, so I think you're looking at different issues. If you're getting a lot of 5113 failures but partway through the work unit rather than at 0%, it's pretty hard to say whether the problem is your machine or the project. We're getting lots of donors successfully completing 5113 work units, so my guess would be that if this happens consistently for you with 5113 that the A0 core is more demanding of your CPU (or in different ways) and is not tolerating the overclock as well. But that's just a guess.

Re: Project 5113

Posted: Mon Sep 01, 2008 3:09 pm
by arfyness
In my limited experience, any error caused by overclocking has affected the whole system.
For example, blue screen, kernel panic, system hang, sudden poweroff, etc.

It's my assumption that our difficulties with these projects (in my case 781) are not due to hardware.

It's probably a good thing I can't overclock this rig. ;o)

Re: Project 5113

Posted: Mon Sep 01, 2008 7:49 pm
by Baowoulf
7im and I think others have said in the past F@H stresses the system more then normal. So even smaller instabilities can affect it. Overclocks that may not cause any problems normally might cause problems for F@H. For the topic creator I'd suggest undoing the overclock on the cpu and see if that helps anything.

Code: Select all

[08:59:27] Simulation instability has been encountered. The run has entered a
[08:59:27]   state from which no further progress can be made.
Especially with this. I've seen this or something similar in other logs where they figured overclocking might be the problem.

Re: Project 5113

Posted: Mon Sep 01, 2008 8:46 pm
by ZVdP
For the topic creator I'd suggest undoing the overclock on the cpu and see if that helps anything.
The project is completed and sent now, but if I get another one of these in the near future, I'll see if it helps.

But what I'm wondering: how do different projects stress the CPU differently? One could say two different projects both use 100% of one core (maybe one needs a bit more memory, but that wouldn't be a problem) and that's it. But I expect ther is something more to it :)

Re: Project 5113

Posted: Mon Sep 01, 2008 9:22 pm
by kasson
There are a few ways different projects could stress the CPU differently. Memory is one; also the Gromacs cores implement a number of different algorithms; depending on what a given project is using one could envision different parts of the chip being used. Gromacs in particular is very efficient in its use of SSE registries. The A0 Windows core contains a different Fourier transform library that is also SSE-optimized; it may also utilize the chip differently. I don't have that level of hardware expertise, but someone else on the forum may.

Re: Project 5113

Posted: Tue Sep 02, 2008 12:20 pm
by toTOW
Let me try ;)

A CPU can be divided into different subparts that won't do the same operations ... here are the most common for a modern CPU (for example a Core 2 or an Athlon X2 or a Phenom X4) ... keep in mind that there is usually many units of one type in a single CPU :

- ALU (Arithmetic and Logic Unit) : the main job of this unit is to do basic arithmetic and logic operations, like addition, comparison, shifting, boolean operators, ... this is the oldest unit that exists in a CPU and it can only work on integers. This unit is not used a lot in FAH, but it is used in cryptography or compression algorithms.
- FPU (Floating Point Unit) : this unit has been added (first as a coprocessor) in the 286 (287 coprocessor) and 386 CPU (387 coprocessor). It is integrated to the CPU since 486. This unit is doing advanced math and logic operations like multiplication, power, division, ... This unit works with floating point numbers (the most common ones in the world), and can use single or double precision. This unit is used in many applications, like games (3D rendering) or multimedia applications for example. In FAH, this unit was used by Tinker core, and is now used in Amber core, or Gromacs when the message (using standard loops) is printed.

The following unit have been added as improvements to the two basic unit :
- MMX (MultiMedia eXtensions) : this unit was added by Intel to the Pentium core to speed up multimedia application. This is an extension to the ALU, and it can only work on integers too. This extension is usually useless, and multimedia application use floating point operation ... :roll: This unit can speed up file compression or cryptography operations, but it's usually not used by FAH.
- SSE (Streaming SIMD Extension) : this unit was first implement into the Pentium 3 CPU. This is one of the most interesting units in a CPU : this is the first unit that is able to apply one instruction to different data at the same time (SIMD : Single Instruction, Multiple Data). This unit works with floating point number (extension to the FPU), so it is used by many applications (multimedia, games, computing, ...). In FAH, this unit is massively used by Gromacs core and its variants (Gromacs, Gromacs33, GroST, GroSMP), and is signaled with the message "Extra SSE boost OK". SSE instruction can only be used in single precision calculations. AMD implemented 3Dnow! in Athlon CPUs to challenge Intel's SSE, which is their equivalent to the SSE instructions (all current AMD processors currently support both SSE and 3Dnow!).
- SSE2 : these are additional instructions added to the original SSE (in Pentium 4 CPUs) to speed up calculations in double precision. They apply to the same type of jobs (games, multimedia, computing, ...) as SSE. These instructions are used by Double Gromacs core in FAH, and it's variants (Dgromacs, Dgromacs B and Dgromacs C).

In addition to the processing units, you need to understand how a CPU gets it's data for memory. There is usually 3 "level" of memory :

- Level 1 cache
- Level 2 cache
- System memory

When the CPU is working on small data that fit in the L1 cache, there is no accesses to the other "memories". As data size grows, it will start using L2 cache, and then system memory.

Now we can talk about power consumption and stability issues. The worst case is of course when a lot of processing units are used, with a lot of data to move between CPU and memory. Here are some examples, with FAH cores and WU :

- Amber core : it's the lightest operation we can find in FAH as it only uses ALU and FPU. These unit are usually small, and won't stress caches a lot too.
- Gromacs (Dgromacs) core : it's the hardest thing to do for the CPU. It uses ALU, FPU and SSE (SSE2). If you're opted for BigWU, it will also stresses caches and memory.
- Gromacs33 is like Gromacs, but with a newer code, it tend ti be more optimized and stressful.
- GroSMP is a bit different : it doesn't stresses CPU as hard as regular Gromacs because processing power is limited by data transfers between CPU cores, but it's easy to guess, it will stresses a lot the caches and memory subsystem. The A2 SMP core is progressively changing the rule as it better use the CPU cores ... So we can say this is one of the "worst" case using ALU, FPU, SSE, caches and system memory.

Damned I wrote a lot :o ... tell me if there is something wrong or that you don't understand. If someone find this post useful, please say so, I might post it in another forum (general FAH or FAH hardware for instance) to continue the topic ;)

Re: Project 5113

Posted: Tue Sep 02, 2008 12:40 pm
by John Naylor
Just as a general explanation, that's brilliant :shock:

Re: Project 5113

Posted: Tue Sep 02, 2008 6:20 pm
by bruce
arfyness wrote:In my limited experience, any error caused by overclocking has affected the whole system.
For example, blue screen, kernel panic, system hang, sudden poweroff, etc.

It's my assumption that our difficulties with these projects (in my case 781) are not due to hardware.

It's probably a good thing I can't overclock this rig. ;o)
For all of the reasons that toTOW has given, this is a weak assumption. We've documented many cases where an overclocked machine can pass certain benchmarks and fail on others. Clearly if you want to run Windows, prime95 is a pretty good test of whether Windows will crash. On the other hand, if you want to run Gromacs which makes heavy use of SSE (but which is not tested by prime95), we strongly recommend you also run stressCPU2 which does a much better job of predicting FAH failure during standard Gromacs runs. It's not really clear that either one will find obscure memory errors but that's what memtest86 is good for.

I'm not sure if any one of them is predictive of Gromacs33 success/failure but I'd certainly run them all -- or remove the overclock -- before I concluded it was not my hardware.

Nevertheless, it is rather difficult to tell whether it's a hardware issue or a WU issue unless we can document repeated failures of the same WU or repeated failure of the same hardware with WUs that others can complete without error.

In this case, ZVdP (team 48658), did complete the WU for full credit so the error was not repeatable.
Your WU (P5113 R18 C82 G1) was added to the stats database on 2008-09-01 12:40:16 for 749 points of credit.

Re: Project 5113

Posted: Tue Sep 02, 2008 11:33 pm
by toTOW
I've also seen some weird things with overclocked AMD (Athlon X2 in my case) and SMP clients, which started to fail were all other tests passed (StressCPU2, Uniprocessor FAH and SMP for months). It looks like the integrated memory controller is to blame, but with no explicit proof, I can't be sure ... so I had to go back to stock clocks and it's now running fine :(

Re: Project 5113

Posted: Sat Sep 06, 2008 10:42 pm
by arfyness
bruce wrote:
arfyness wrote:In my limited experience, any error caused by overclocking has affected the whole system.
For example, blue screen, kernel panic, system hang, sudden poweroff, etc.

It's my assumption that our difficulties with these projects (in my case 781) are not due to hardware.
It's probably a good thing I can't overclock this rig. ;o)
For all of the reasons that toTOW has given, this is a weak assumption. We've documented many cases where an overclocked machine can pass certain benchmarks and fail on others.
Fair enough ... I stand corrected. :oops:

Thanks for pointing that out. :o)

Re: Project 5113

Posted: Mon Sep 08, 2008 12:36 pm
by ZVdP
I received another p5113 today and got a new crash of fahcore_a0.exe. But this time I spotted a Windows message in the bottom left corner on the screen.
So I clicked on it for more information,
and it said Windows had closed the fahcore:
(I'm translatingfrom Dutch, so it will possibly change from the original)
prevention of data execution (DEP)
for safety against computer damage from virusses and other attacks... and so on

I have now put fahcore_a0.exe in a list so it will be excluded by 'DEP'.
Hopefuly that will stop the crashes.

Are there other reports about problems with DEP?

Re: Project 5113

Posted: Mon Sep 08, 2008 5:11 pm
by Baowoulf
Have you tried backing or or turning off the overclock of your cpu and seeing if it works with P5113 yet?

Re: Project 5113

Posted: Sat Sep 13, 2008 8:43 am
by ZVdP
Have you tried backing or or turning off the overclock of your cpu and seeing if it works with P5113 yet?
Indeed, I tried it yesterday, and no crashes anymore.

Now I have set the voltages a bit higher, they were at the minimum of what would work. Now only waiting for another 5113 to test.