Project: 5757 (Run 10, Clone 600, Gen 3)

Moderators: Site Moderators, FAHC Science Team

Post Reply
bollix47
Posts: 2953
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Project: 5757 (Run 10, Clone 600, Gen 3)

Post by bollix47 »

FYI

This particular RCG is giving me an UNSTABLE_MACHINE error. The gpu(GTX295) has done other P5757 WUs without problems and has done over 1000 WUs in total so I'm fairly certain that the card is okay. The WU failure was repeated enough times to pause for 24 hours and although it did run a different WU successfully after the 24 hour stoppage it then tried to do this WU again resulting in the same error. I had to delete the WU a few times and add -advmethods to the command line before I could get a different WU.

Unfortunately the WU crashes before it has enough info to send back something to the server so there's no indication to stop sending the WU.


Code: Select all

[12:25:44] + Processing work unit

[12:25:44] Core required: FahCore_11.exe

[12:25:44] Core found.

[12:25:44] Working on queue slot 01 [June 27 12:25:44 UTC]

[12:25:44] + Working ...

[12:25:44] - Calling '.\FahCore_11.exe -dir work/ -suffix 01 -priority 96 -nocpulock -checkpoint 30 -verbose -lifeline 67 -version 623'


[12:25:44] 

[12:25:44] *------------------------------*

[12:25:44] Folding@Home GPU Core - Beta

[12:25:44] Version 1.19 (Mon Nov 3 09:34:13 PST 2008)

[12:25:44] 

[12:25:44] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 

[12:25:44] Build host: amoeba

[12:25:44] Board Type: Nvidia

[12:25:44] Core      : 

[12:25:44] Preparing to commence simulation

[12:25:44] - Looking at optimizations...

[12:25:44] - Created dyn

[12:25:44] - Files status OK

[12:25:44] - Expanded 70666 -> 360060 (decompressed 509.5 percent)

[12:25:44] Called DecompressByteArray: compressed_data_size=70666 data_size=360060, decompressed_data_size=360060 diff=0

[12:25:44] - Digital signature verified

[12:25:44] 

[12:25:44] Project: 5757 (Run 10, Clone 600, Gen 3)

[12:25:44] 

[12:25:44] Assembly optimizations on if available.

[12:25:44] Entering M.D.

[12:25:51] Working on Protein

[12:25:52] Client config found, loading data.

[12:25:52] Starting GUI Server

[12:25:52] mdrun_gpu returned 

[12:25:52] NANs detected on GPU

[12:25:52] 

[12:25:52] Folding@home Core Shutdown: UNSTABLE_MACHINE

[12:25:54] CoreStatus = 7A (122)

[12:25:54] Sending work to server

[12:25:54] Project: 5757 (Run 10, Clone 600, Gen 3)

[12:25:54] - Read packet limit of 540015616... Set to 524286976.

[12:25:54] - Error: Could not get length of results file work/wuresults_01.dat

[12:25:54] - Error: Could not read unit 01 file. Removing from queue.

[12:25:54] Trying to send all finished work units

[12:25:54] + No unsent completed units remaining.

Image
OldChap
Posts: 3
Joined: Thu Jan 01, 2009 10:27 am

Re: Project: 5757 (Run 10, Clone 600, Gen 3)

Post by OldChap »

I too have had this one (twice today) with the same outcome as bollix47
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 5757 (Run 10, Clone 600, Gen 3)

Post by bruce »

I've notified the appropriate person that the WU is apparently bad.
vvoelz
Pande Group Member
Posts: 543
Joined: Sun Dec 02, 2007 8:07 pm
Location: Temple University, Philadelphia PA

Re: Project: 5757 (Run 10, Clone 600, Gen 3)

Post by vvoelz »

Bollix, bruce:

As you suspected the WU was bad (the last frame returned had instability). I've stopped that clone from being released in the future, and will yank it from the job stack too.

Thanks,
Vince
Post Reply