Page 1 of 1

Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Sun Aug 29, 2010 10:55 pm
by Napoleon
Bad WU?

Code: Select all

[21:44:48] Project: 2633 (Run 8, Clone 24, Gen 6)
[21:44:48] 
[21:44:48] Assembly optimizations on if available.
[21:44:48] Entering M.D.
[21:44:54] Completed 0 out of 625000 steps  (0%)
[21:56:55] Completed 6250 out of 625000 steps  (1%)
[21:59:24] - Autosending finished units... [August 29 21:59:24 UTC]
[21:59:24] Trying to send all finished work units
[21:59:24] + No unsent completed units remaining.
[21:59:24] - Autosend completed
[22:12:13] mdrun returned 255
[22:12:13] Going to send back what have done -- stepsTotalG=625000
[22:12:13] Work fraction=0.0188 steps=625000.
[22:12:17] logfile size=0 infoLength=0 edr=0 trr=25
[22:12:17] logfile size: 0 info=0 bed=0 hdr=25
[22:12:17] - Writing 642 bytes of core data to disk...
[22:12:17]   ... Done.
[22:12:17] 
[22:12:17] Folding@home Core Shutdown: UNSTABLE_MACHINE
[22:12:20] CoreStatus = 7A (122)
[22:12:20] Sending work to server
[22:12:20] Project: 2633 (Run 8, Clone 24, Gen 6)


[22:12:20] + Attempting to send results [August 29 22:12:20 UTC]
[22:12:20] - Reading file work/wuresults_08.dat from core
[22:12:20]   (Read 642 bytes from disk)
[22:12:20] Connecting to http://171.67.108.24:8080/
[22:12:21] Posted data.
[22:12:21] Initial: 0000; - Uploaded at ~1 kB/s
[22:12:21] - Averaged speed for that direction ~75 kB/s
[22:12:21] + Results successfully sent
[22:12:21] Thank you for your contribution to Folding@Home.
[22:12:25] Trying to send all finished work units
[22:12:25] + No unsent completed units remaining.
[22:12:25] - Preparing to get new work unit...
[22:12:25] Cleaning up work directory
[22:12:25] + Attempting to get work packet
[22:12:25] Passkey found
[22:12:25] - Will indicate memory of 1536 MB
[22:12:25] - Connecting to assignment server
[22:12:25] Connecting to http://assign.stanford.edu:8080/
[22:12:26] Posted data.
[22:12:26] Initial: 40AB; - Successful: assigned to (171.64.65.54).
[22:12:27] + News From Folding@Home: Welcome to Folding@Home
[22:12:27] Loaded queue successfully.
[22:12:27] Sent data
[22:12:27] Connecting to http://171.64.65.54:8080/
[22:12:28] Posted data.
[22:12:28] Initial: 0000; - Receiving payload (expected size: 1768007)
[22:12:33] - Downloaded at ~345 kB/s
[22:12:33] - Averaged speed for that direction ~608 kB/s
[22:12:33] + Received work.
[22:12:33] Trying to send all finished work units
[22:12:33] + No unsent completed units remaining.
[22:12:33] + Closed connections
[22:12:38] 
[22:12:38] + Processing work unit
[22:12:38] Core required: FahCore_a3.exe
[22:12:38] Core found.
[22:12:38] Working on queue slot 09 [August 29 22:12:38 UTC]
[22:12:38] + Working ...
[22:12:38] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 09 -np 2 -priority 96 -checkpoint 30 -forceasm -verbose -lifeline 1984 -version 630'

[22:12:38] 
[22:12:38] *------------------------------*
[22:12:38] Folding@Home Gromacs SMP Core
[22:12:38] Version 2.22 (Mar 12, 2010)
[22:12:38] 
[22:12:38] Preparing to commence simulation
[22:12:38] - Assembly optimizations manually forced on.
[22:12:38] - Not checking prior termination.
[22:12:38] - Expanded 1767495 -> 1971489 (decompressed 111.5 percent)
[22:12:38] Called DecompressByteArray: compressed_data_size=1767495 data_size=1971489, decompressed_data_size=1971489 diff=0
[22:12:38] - Digital signature verified
[22:12:38] 
[22:12:38] Project: 6020 (Run 0, Clone 114, Gen 278)
[22:12:38] 
[22:12:38] Assembly optimizations on if available.
[22:12:38] Entering M.D.
[22:12:46] Completed 0 out of 500000 steps  (0%)

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Sun Aug 29, 2010 11:05 pm
by sortofageek
It is too soon to tell so far, sorry. The only results I can see at this time are yours.

Hi Napoleon (team 191980),
Your WU (P2633 R8 C24 G6) was added to the stats database on 2010-08-29 15:13:02 for 4.42 points of credit.

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Mon Aug 30, 2010 4:54 pm
by sortofageek
This work unit was a good one. Another folder was able to complete it for full credit.

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Mon Aug 30, 2010 7:32 pm
by Napoleon
Interesting... I'll keep an eye on my rig, then. It's old, but still, a workstation board, not overclocked, has ECC memory, temps are well within specifications. Anyway, I tightened the memory timings manually a while back. Have to check if something has appeared in the ECC logs at next reboot. 2633 was the first one to give me trouble, though. And looks like a few other fellow folders have had trouble with 2633, too.

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Mon Aug 30, 2010 7:37 pm
by sortofageek
Maybe just keep an eye on that one. If this doesn't continue to happen, it may be a glitch unrelated to your equipment.

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Mon Aug 30, 2010 8:34 pm
by Napoleon
Logs actually showed corrected memory errors happening quite frequently, but I didn't spot a single uncorrectable one, no BSODs either. Back to stock memory timings. At a glance, the constant error corrections are no longer taking place. Will still keep monitoring the memory just in case some mem chip is (going) bad. But certainly looks like the problem really was overly tight memory timings, not a bad WU. Sorry for the hassle, my bad.

I did run memtest86+ for quite a long time before deploying the timings to everyday use, but I guess synthetic tests aren't perfect at catching subtle real world problems. :|

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Mon Aug 30, 2010 8:45 pm
by sortofageek
Excellent troubleshooting. Always suspect the last change made before trouble began. Hope this resolves your issues.

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Tue Aug 31, 2010 12:16 am
by PantherX
Napoleon wrote:...I did run memtest86+ for quite a long time before deploying the timings to everyday use, but I guess synthetic tests aren't perfect at catching subtle real world problems. :|
Recently I did run v4.10 for 2 passes (that's the minimum) and it was without any errors. However, It would fail IBT @ 2048 MB. I did a little research and found that Memtest86+ is actually used for testing hardware faults like bad modules, slots, etc. I also read that just because RAM passed Memtest86+, it doesn't necessarily mean that it is Windows stable. Once I found this out, I loosened the RAM timings and it passed IBT @ 2048 MB. However, under some RAM timings, I found that IBT @ 1024 MB and 2048 MB were stable for 5 or 10 runs but when I went for Maximum (>3 GB), it would fail. Although, I was able to fold a single bigadv WU when the machine was unstable at Maximum, I decided that the risk is too high and tweaked the timings so that my system is stable at IBT @ Maximum.

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Tue Aug 31, 2010 7:52 am
by bruce
Napoleon wrote:Interesting... I'll keep an eye on my rig, then. It's old, but still, a workstation board, not overclocked, has ECC memory, temps are well within specifications.
According to your profile this is a dual Opteron. Which model?

It has full support for both SSE and SSE2, right?

Re: Project: 2633 (Run 8, Clone 24, Gen 6) - UNSTABLE_MACHINE

Posted: Tue Aug 31, 2010 8:52 am
by Napoleon
Here's what CPUz says, I have 2 of these. Tyan Thunder K8W is a dual socket motherboard.

Image

EDIT: checked back on the logs, apparently I tightened the memory timings 9th August 2010, that's the earliest occurrence of memory error corrections happening. Haven't seen them anymore after reverting to stock memory timings. Anyway, this was the first obvious sign of trouble after running FAH for 3 weeks, so I've completed WUs other than 2633 succesfully with the tighter timings. Of course, there shouldn't be any ECC events happening at all, or at least they should be extremely rare.