Page 1 of 1

Project: 5757 (Run 10, Clone 600, Gen 3)

Posted: Sat Jun 27, 2009 1:22 pm
by bollix47
FYI

This particular RCG is giving me an UNSTABLE_MACHINE error. The gpu(GTX295) has done other P5757 WUs without problems and has done over 1000 WUs in total so I'm fairly certain that the card is okay. The WU failure was repeated enough times to pause for 24 hours and although it did run a different WU successfully after the 24 hour stoppage it then tried to do this WU again resulting in the same error. I had to delete the WU a few times and add -advmethods to the command line before I could get a different WU.

Unfortunately the WU crashes before it has enough info to send back something to the server so there's no indication to stop sending the WU.


Code: Select all

[12:25:44] + Processing work unit

[12:25:44] Core required: FahCore_11.exe

[12:25:44] Core found.

[12:25:44] Working on queue slot 01 [June 27 12:25:44 UTC]

[12:25:44] + Working ...

[12:25:44] - Calling '.\FahCore_11.exe -dir work/ -suffix 01 -priority 96 -nocpulock -checkpoint 30 -verbose -lifeline 67 -version 623'


[12:25:44] 

[12:25:44] *------------------------------*

[12:25:44] Folding@Home GPU Core - Beta

[12:25:44] Version 1.19 (Mon Nov 3 09:34:13 PST 2008)

[12:25:44] 

[12:25:44] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 

[12:25:44] Build host: amoeba

[12:25:44] Board Type: Nvidia

[12:25:44] Core      : 

[12:25:44] Preparing to commence simulation

[12:25:44] - Looking at optimizations...

[12:25:44] - Created dyn

[12:25:44] - Files status OK

[12:25:44] - Expanded 70666 -> 360060 (decompressed 509.5 percent)

[12:25:44] Called DecompressByteArray: compressed_data_size=70666 data_size=360060, decompressed_data_size=360060 diff=0

[12:25:44] - Digital signature verified

[12:25:44] 

[12:25:44] Project: 5757 (Run 10, Clone 600, Gen 3)

[12:25:44] 

[12:25:44] Assembly optimizations on if available.

[12:25:44] Entering M.D.

[12:25:51] Working on Protein

[12:25:52] Client config found, loading data.

[12:25:52] Starting GUI Server

[12:25:52] mdrun_gpu returned 

[12:25:52] NANs detected on GPU

[12:25:52] 

[12:25:52] Folding@home Core Shutdown: UNSTABLE_MACHINE

[12:25:54] CoreStatus = 7A (122)

[12:25:54] Sending work to server

[12:25:54] Project: 5757 (Run 10, Clone 600, Gen 3)

[12:25:54] - Read packet limit of 540015616... Set to 524286976.

[12:25:54] - Error: Could not get length of results file work/wuresults_01.dat

[12:25:54] - Error: Could not read unit 01 file. Removing from queue.

[12:25:54] Trying to send all finished work units

[12:25:54] + No unsent completed units remaining.


Re: Project: 5757 (Run 10, Clone 600, Gen 3)

Posted: Sat Jun 27, 2009 7:29 pm
by OldChap
I too have had this one (twice today) with the same outcome as bollix47

Re: Project: 5757 (Run 10, Clone 600, Gen 3)

Posted: Mon Jun 29, 2009 3:29 am
by bruce
I've notified the appropriate person that the WU is apparently bad.

Re: Project: 5757 (Run 10, Clone 600, Gen 3)

Posted: Mon Jun 29, 2009 7:36 pm
by vvoelz
Bollix, bruce:

As you suspected the WU was bad (the last frame returned had instability). I've stopped that clone from being released in the future, and will yank it from the job stack too.

Thanks,
Vince