Page 1 of 2
Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 3:59 pm
by DrSpalding
Hi, my -bigadv machine just turned in (or tried to) early results on the above project at 44% complete. The corestatus was 0xC0000005 (which is a STATUS_ACCESS_VIOLATION fault) and a Client-core communcications error. I suspect that the core terminated w/o telling the client what was happening. Is there any info on this WU about its stability or should I look into the machine for hardware issues?
Here is the relevant log info:
Code: Select all
[05:39:55] - Preparing to get new work unit...
[05:39:55] Cleaning up work directory
[05:39:55] + Attempting to get work packet
[05:39:55] Passkey found
[05:39:55] - Connecting to assignment server
[05:39:55] - Successful: assigned to (171.67.108.22).
[05:39:55] + News From Folding@Home: Welcome to Folding@Home
[05:39:56] Loaded queue successfully.
[05:40:54] + Closed connections
[05:40:54]
[05:40:54] + Processing work unit
[05:40:54] Core required: FahCore_a3.exe
[05:40:54] Core found.
[05:40:54] Working on queue slot 06 [July 18 05:40:54 UTC]
[05:40:54] + Working ...
[05:40:54]
[05:40:54] *------------------------------*
[05:40:54] Folding@Home Gromacs SMP Core
[05:40:54] Version 2.22 (Mar 12, 2010)
[05:40:54]
[05:40:54] Preparing to commence simulation
[05:40:54] - Looking at optimizations...
[05:40:54] - Created dyn
[05:40:54] - Files status OK
[05:40:58] - Expanded 24821153 -> 30791309 (decompressed 124.0 percent)
[05:40:58] Called DecompressByteArray: compressed_data_size=24821153 data_size=30791309, decompressed_data_size=30791309 diff=0
[05:40:59] - Digital signature verified
[05:40:59]
[05:40:59] Project: 2684 (Run 2, Clone 5, Gen 10)
[05:40:59]
[05:40:59] Assembly optimizations on if available.
[05:40:59] Entering M.D.
[05:41:09] Completed 0 out of 250000 steps (0%)
[06:30:30] Completed 2500 out of 250000 steps (1%)
[07:15:45] Completed 5000 out of 250000 steps (2%)
...
[14:30:35] Completed 107500 out of 250000 steps (43%)
[15:15:24] Completed 110000 out of 250000 steps (44%)
[15:19:19] Gromacs cannot continue further.
[15:19:19] Going to send back what have done -- stepsTotalG=250000
[15:19:19] Work fraction=-1.#IND steps=250000.
[15:19:49] logfile size=97434 infoLength=97434 edr=0 trr=23
[15:19:49] logfile size: 97434 info=97434 bed=0 hdr=23
[15:19:49] - Writing 97970 bytes of core data to disk...
[15:19:52] CoreStatus = C0000005 (-1073741819)
[15:19:52] Client-core communications error: ERROR 0xc0000005
[15:19:52] Deleting current work unit & continuing...
[15:20:34] - Preparing to get new work unit...
[15:20:34] Cleaning up work directory
[15:20:34] + Attempting to get work packet
[15:20:34] Passkey found
[15:20:34] - Connecting to assignment server
[15:20:34] - Successful: assigned to (171.67.108.22).
[15:20:34] + News From Folding@Home: Welcome to Folding@Home
[15:20:35] Loaded queue successfully.
[15:21:05] + Closed connections
[15:21:10]
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 4:11 pm
by toTOW
No data for this WU in the DB yet ...
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 4:41 pm
by DrSpalding
The assignment server gave the same WU back to the machine, so the results must not have gotten uploaded. We'll see in another 33 hours if it does the same thing again and if so, I'll have to get the machine to move on to another WU manually.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 6:54 pm
by bruce
Please do a thorough memory test on your system at the actual temperatures that your system sees when folding. There is no single cause for 0xC0000005 but the most common one is memory errors (including memory timing settings that are just a bit too tight for your memory as well as chipset errors that result in memory errors).
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:02 pm
by Grandpa_01
bruce wrote:Please do a thorough memory test on your system at the actual temperatures that your system sees when folding. There is no single cause for 0xC0000005 but the most common one is memory errors (including memory timing settings that are just a bit too tight for your memory as well as chipset errors that result in memory errors).
Good advice bruce especiall this part
Please do a thorough memory test on your system at the actual temperatures that your system see's when folding.
which is very hard to do.
Do you know of a memory test that will use 100% of the CPU and create the heat that folding does. Or a memory test that can be run while folding.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:06 pm
by PantherX
Grandpa_01 wrote:...Do you know of a memory test that will use 100% of the CPU and create the heat that folding does...
I use IntelBurnTest and configure it for Maximum RAM and it generates more heat than F@H does. It stress the RAM and CPU at the same time so I really like it. After that, I run StressCPU to ensure that the stable system is also scientifically stable too as it is for F@H.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:15 pm
by DrSpalding
What do you consider a suitable memory test? Prime95 with a 10GB data set while running the GPU client on the GTX275 as well? The problem with doing a standalone memory test (a la memtest x86 or the Win7 memdiag) is that I can't saturate the machine further with the bandwidth and heat from the GPU as well.
The memory is 7-7-7-24 @ 1333 MHz, but it is running at a slower speed, at least according to the BIOS, at 7-7-7-24 @ ~1146MHz.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:29 pm
by 7im
Prime is lame, it's not up to the task, hence your problem with not being able to saturate the system.
You could also use the memtestG80, which is a memory testing for the NV Cards. Then combine with whatever tool you like that they recommended above.
IBT and OCCT (mixed test) are about as good as they get for maxing both CPU and Memory. Throw in the memtestG80 and you're all set.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:33 pm
by PantherX
@DrSpalding: (Based on my experience) Prime95 produces less stress than F@H and IntelBurnTest is more stressful than F@H if configured properly. I prefer IntelBurnTest with 10 iterations @ Maximum setting. What I do is first fire off IBT @ Maximum for 10 iterations. If passed successfully, I set it at 6 threads with 3 GB RAM and then run Furmark @ Maximum settings and run Hyper PI 0.99 Beta @ 32 Million on the last free thread. Thus I have stressed my CPU and GPU. I use HWMonitor and if any temperature passes 90C, I terminate everything and downclock and repeat until the Maximum temperature Value is =<90C.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:43 pm
by DrSpalding
PantherX wrote:@DrSpalding: (Based on my experience) Prime95 produces less stress than F@H and IntelBurnTest is more stressful than F@H if configured properly. I prefer IntelBurnTest with 10 iterations @ Maximum setting. What I do is first fire off IBT @ Maximum for 10 iterations. If passed successfully, I set it at 6 threads with 3 GB RAM and then run Furmark @ Maximum settings and run Hyper PI 0.99 Beta @ 32 Million on the last free thread. Thus I have stressed my CPU and GPU. I use HWMonitor and if any temperature passes 90C, I terminate everything and downclock and repeat until the Maximum temperature Value is =<90C.
1. Where do you get IBT and/or OCCT? I found downloads for both (IBT v2.3 and OCCT v3) on guru3d.com but don' t know which versions are the up-to-date ones.
2. I have noted that Prime95 gets the cpu cores a couple of degrees hotter than F@H seems to. It seems to hold that high temperature more stably than F@H does too, FWIW.
3. Is running a GPU client sufficient to test the GPU + CPU at the same time when running IBT or OCCT?
Thanks,
Dan
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 7:49 pm
by PantherX
1) You can check the
tools list in most cases, it contains the links to the latest softwares (FYI, IBT 2.4 is latest).
2) YMMV but on my system, it took Prime95 longer to reach the tempratures of F@H and never exceed them. IBT on the other hand, overshoot the F@H temps in <5 minutes.
3) IBT is specific to CPU only. OCCT I have heard that it includes the LinX (used in IBT) and also has its own GPU stress software. I haven't used OCCT so can't be specific.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Mon Jul 19, 2010 8:14 pm
by bruce
DrSpalding wrote:Is running a GPU client sufficient to test the GPU + CPU at the same time when running IBT or OCCT?
Probably not.
First, running a GPU client means different things to the CPU, depending on whether you have ATi, Fermi, or a G80.
Second (as noted earlier), it's almost impossible to test everything simultaneously. For simplicity's sake I'm going to divide a system into RAM, GPU, and CPU and further divide the CPU into ALU and FPU. A single test can maximize the use of any one of them but not all of them simultaneously. Picking a test that deals with each one separately is fairly easily, but finding something that comes close to maximizing all of the simultaneously is next to impossible. FAH will also be limited by the maximum of one of them but will use the others at somewhat less than maximum so finding something that is close to the way your system runs FAH means you'll probably have to run more than one test. That's one reason why you always have to back down from whatever settings seem to be stable.
Prime probably maximizes the use of the ALU but doesn't maximize the FPU or RAM and certainly not the GPU.
Memtest86 probably maximizes RAM but doesn't use much ALU or FPU or GPU.
The GPU client probably comes close to maximizing the use of the GPU but doesn't saturate the ALU and uses virtually no FPU. (Then, too, the various GPU benchmark tests may maximize different aspects of the GPU, but let's not go into that.)
StressCPU2 probably maximizes the FPU similar to FAH's SMP client but may not catch errors in other components.
Integrated tests do a better job of balancing the use of ALU/FPU/RAM so adding a GPU client or benchmark helps find heating issues but we can then debate the relative priority of the two tasks.
No matter what tests you run, you'll probably need more than one and you'll still need to add additional margin.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Tue Jul 20, 2010 6:06 pm
by DrSpalding
I ran the IBT several times standalone, including a set of 10 at 10+GB of memory allocation and it worked flawlessly, ending in about 3 hours. However, the next set I ran, I also included the running of the GPU client with its priority set to normal so that it would make sure and run. Within about 11 minutes, the machine bug-checked on me with the nebulous "MACHINE_CHECK_EXCEPTION" of 9C. That one is a catch-all for various MCA exceptions from the CPU and w/o a debugger attached to the machine, it is hard to figure anything else out from it. I suspect memory or bus issues since the nVidia GTX275 GPU running an F@H client really only intersects the machine at the bus and memory. If anyone has an idea what I should tweak first (vCore for CPU, memory timings, etc.), please feel free to drop me a message.
For now, I am running w/o the GPU client until I get it sorted out.
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Tue Jul 20, 2010 6:40 pm
by PantherX
@DrSpalding: Have you overclocked the System? If yes, return everything to stock and see if the error arises.
Have you changed any variables in the motherboard? If yes, change everything to stock and then try it again.
Have you tried to change the PCI-E Slot of the GPU and repeat the test? If yes, was the error a same one or not?
Can you run MemtestG80 on the GPU without any problems? (mode details in my guide; link in sig)
Is your PSU stable enough to provided enough power to both the CPU and GPU when both are at 100% load?
Re: Project: 2684 (Run 2, Clone 5, Gen 10)
Posted: Tue Jul 20, 2010 7:07 pm
by B2K24
I got errors with bigadv when I manually set the timings in bios 7-7-7-20 as the sticker on my Corsair Dominator C7's reads but when I put all timings to AUTO bios gives them 9-9-9-24 and have had no stability issues running with auto timings.