Page 1 of 1

Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Tue Jun 29, 2010 3:09 pm
by ThunderRd
This WU hardlocked my Q6600 repeatedly at 0%. The client started, and simply froze the box, requiring a cold reboot. This could be reproduced both overclocked to 3.5G or at the stock 2.4G.

After several attempts to get it started I removed it and downloaded a 6066 in its place, which ran normally.

I'd be interested if anyone else has had a problem with this specific WU, as this machine has been flawless for a long time.

Code: Select all

--- Opening Log file [June 29 14:48:15 UTC] 


# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: D:\Program Files\smp
Executable: D:\Program Files\smp\smp629.exe
Arguments: -deino -smp -advmethods -verbosity 9 

[14:48:15] - Ask before connecting: No
[14:48:15] - User name: ThunderRd (Team 45)
[14:48:15] - User ID: 6347889308561847
[14:48:15] - Machine ID: 1
[14:48:15] 
[14:48:15] Loaded queue successfully.
[14:48:15] 
[14:48:15] - Autosending finished units... [June 29 14:48:15 UTC]
[14:48:15] + Processing work unit
[14:48:15] Trying to send all finished work units
[14:48:15] Core required: FahCore_a3.exe
[14:48:15] + No unsent completed units remaining.
[14:48:15] - Autosend completed
[14:48:15] Core found.
[14:48:15] Working on queue slot 00 [June 29 14:48:15 UTC]
[14:48:15] + Working ...
[14:48:15] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 00 -np 4 -checkpoint 15 -verbose -lifeline 2620 -version 629'

[14:48:15] 
[14:48:15] *------------------------------*
[14:48:15] Folding@Home Gromacs SMP Core
[14:48:15] Version 2.22 (Mar 12, 2010)
[14:48:15] 
[14:48:15] Preparing to commence simulation
[14:48:15] - Ensuring status. Please wait.
[14:48:25] - Looking at optimizations...
[14:48:25] - Working with standard loops on this execution.
[14:48:25] - Previous termination of core was improper.
[14:48:25] - Going to use standard loops.
[14:48:25] - Files status OK
[14:48:25] - Expanded 1798410 -> 2396877 (decompressed 133.2 percent)
[14:48:25] Called DecompressByteArray: compressed_data_size=1798410 data_size=2396877, decompressed_data_size=2396877 diff=0
[14:48:25] - Digital signature verified
[14:48:25] 
[14:48:25] Project: 6014 (Run 2, Clone 145, Gen 59)
[14:48:25] 
[14:48:25] Entering M.D.
[14:48:32] Completed 0 out of 500000 steps  (0%)

Re: Project 6014 (R 2, C 145, G 59)

Posted: Tue Jul 06, 2010 1:51 am
by ThunderRd
I am appending this post for another WU in the same project.

After this incident I did not pick up another 6041, processing all other WUs without a problem.

Today another 6014 came: R1 C78 G202. It hardlocked the machine in exactly the same fashion as the previous one.There are no clues in the error log; if there is a client-core comm error I can't see it because the logging halts suddenly with the hard lock. This crash requires a cold boot to get going again. As before, I could reproduce the problem at will.

After several attempts I dumped the WU, and got a 6701, which runs fine.

Sorry I can't be more helpful with the log.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Tue Jul 06, 2010 5:49 am
by bruce
It's most likely overclocking or overheating. Windows is very good about protecting itself from a run-away application, even if that happened (which Is unlikely).

Have you confirmed stability running stresscpu2?

We're getting scattered reports that 6014 runs just a tad hotter than other projects.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Tue Jul 06, 2010 7:23 am
by ThunderRd
Thanks for answering, bruce.

While I had the WU I was able to make the machine crash both at stock speed and overclocked. Nevertheless, I ran stressCPU as you suggested. It is now running for almost three hours with no problems. I do not expect that there will be any. It's a big, cool box, watercooled on the CPU, Northbridge, and GPU, and with stressCPU running, all cores are in the mid to high 50s.

Next time I get one of these [and if it's a problem], I'll keep a control copy to experiment, rather than deleting the WU. I would be interested in finding out why this happens on only this project number.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Wed Jul 07, 2010 2:34 pm
by ThunderRd
Well, it happened again, this time R2 C50 G81.

Same situation as before, the client starts, core engages, and craps out a minute or two into the run, leaving behind a locked computer requiring a cold boot.

I have never seen this behavior before. This particular quad box has folded thousands of SMP wus with very few problems. I can't remember the last time it borked, months for sure, maybe a year of 24/7 folding. Yesterday I ran stressCPU for 6 hours without a grumble.

Just to make sure, I copied the requisite files into a temp directory, and downloaded a new WU, this time 6064. I ran the 6064 for several hours with no problem. I loaded up the 6014 and again it froze the machine. I removed the overclock on all components, and ran the SMP client alone, without the GPU client running on my GTX470. I rebooted the machine and tried to run it again. Same exact problem, overclocked or not. Log is identical to the one in my OP.

Clearly all 3 of these WUs can't be bad; but I'm not one to believe in coincidences. There is something different about the way this project runs on this specific machine, and I have yet to find out what it is.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Wed Jul 07, 2010 5:56 pm
by bruce
Can you run stresscpu2? Perhaps your heatsink or hs-fan has a problem. The CPU could be overheating as soon as you put a load on it.

Is it a coicidence that someone else has successfully completed the first of your WUs:
WU (P6014 R2 C145 G59) was added to the stats database on 2010-07-03 01:09:12 for 1792.25 points of credit.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Wed Jul 07, 2010 7:23 pm
by ThunderRd
As I said before, on your suggestion I ran stressCPU for 6 hours yesterday, with no problems. Ok, I suppose if I ran it longer it might still have crashed, but the crashes I have on 6014 are happening within a minute of running, so running stressCPU longer won't really tell me anything.

I'm sure it is some problem with my equipment, but I don't understand how stressCPU [as well as WUs other than 6014] will run but 6014 will not. I have to keep looking for the source of the problem. Since the machine is watercooled and the temps are quite low, I seriously doubt that heat is the problem. It might be time to check out some memory timings.

I'm glad that someone was able to finish, though.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Wed Jul 07, 2010 7:37 pm
by bruce
That's a good idea. Some projects are more sensitive to memory settings that cpu settings and nobody knows which until somebody figures it out like (hopefully) you're about to do.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Wed Jul 07, 2010 10:47 pm
by Tynat
For testing the memory, I like MemTest86+. I just burn it to a CD and power up the computer with the CD in the drive. I usually run it for at least 8 hours, but memory issues can turn up sooner.

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Thu Jul 08, 2010 2:03 pm
by ThunderRd
Tynat wrote:For testing the memory, I like MemTest86+.
The memory itself has been tested and is known good. I am going to experiment with some alternate timings in the primary memory settings and report back afterwards. This DDR3 can run at extremely tight settings; currently it's at DDR3-1750/5-5-7-12-1T. It has been stable since forever. I don't deny, though, that new projects will find ways to test our resolve ;)

Re: Project 6014 (R 2, C 145, G 59) and (R1 C78 G202)

Posted: Fri Jul 09, 2010 2:09 pm
by ThunderRd
It took quite a while to find, but I seem to have gotten this rig stable by loosening tRAS from 12 to 14, so my timings are now 5-5-7-14 at 1T. A 6014 is processing now and seems to be running OK.

IDK if this will help anyone else, but if you have a sudden problem with stability on a previously stable machine, you might want to try.