Project: 5775 (Run 4, Clone 12, Gen 35) : UNSTABLE_MACHINE

Moderators: Site Moderators, FAHC Science Team

Post Reply
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Project: 5775 (Run 4, Clone 12, Gen 35) : UNSTABLE_MACHINE

Post by toTOW »

Failed 5 times with NaNs detected on GPU and Self test failure errors.

Code: Select all

[10:58:01] Project: 5775 (Run 4, Clone 12, Gen 35)
[10:58:01] 
[10:58:01] Assembly optimizations on if available.
[10:58:01] Entering M.D.
[10:58:07] Working on Protein
[10:58:09] Client config found, loading data.
[10:58:09] Starting GUI Server
[10:59:48] Completed 1%
[10:59:48] mdrun_gpu returned 
[10:59:48] NANs detected on GPU
[10:59:48] 
[10:59:48] Folding@home Core Shutdown: UNSTABLE_MACHINE
[10:59:51] CoreStatus = 7A (122)
[10:59:51] Sending work to server
[10:59:51] Project: 5775 (Run 4, Clone 12, Gen 35)
[10:59:51] - Read packet limit of 540015616... Set to 524286976.
[10:59:51] - Error: Could not get length of results file work/wuresults_05.dat
[10:59:51] - Error: Could not read unit 05 file. Removing from queue.
[...]
[10:59:59] Project: 5775 (Run 4, Clone 12, Gen 35)
[10:59:59] 
[10:59:59] Assembly optimizations on if available.
[10:59:59] Entering M.D.
[11:00:05] Working on Protein
[11:00:07] Client config found, loading data.
[11:00:07] Starting GUI Server
[11:01:46] Completed 1%
[11:01:46] mdrun_gpu returned 
[11:01:46] NANs detected on GPU
[11:01:46] 
[11:01:46] Folding@home Core Shutdown: UNSTABLE_MACHINE
[11:01:49] CoreStatus = 7A (122)
[11:01:49] Sending work to server
[11:01:49] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:01:49] - Read packet limit of 540015616... Set to 524286976.
[11:01:49] - Error: Could not get length of results file work/wuresults_06.dat
[11:01:49] - Error: Could not read unit 06 file. Removing from queue.
[...]
[11:01:58] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:01:58] 
[11:01:58] Assembly optimizations on if available.
[11:01:58] Entering M.D.
[11:02:04] Working on Protein
[11:02:06] Client config found, loading data.
[11:02:06] Starting GUI Server
[11:04:05] Completed 1%
[11:06:03] Completed 2%
[11:07:00] mdrun_gpu returned 
[11:07:00] NANs detected on GPU
[11:07:00] 
[11:07:00] Folding@home Core Shutdown: UNSTABLE_MACHINE
[11:07:04] CoreStatus = 7A (122)
[11:07:04] Sending work to server
[11:07:04] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:07:04] - Read packet limit of 540015616... Set to 524286976.
[11:07:04] - Error: Could not get length of results file work/wuresults_07.dat
[11:07:04] - Error: Could not read unit 07 file. Removing from queue.
[...]
[11:07:12] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:07:12] 
[11:07:12] Assembly optimizations on if available.
[11:07:12] Entering M.D.
[11:07:19] Working on Protein
[11:07:21] mdrun_gpu returned 
[11:07:21] Self-test failure
[11:07:21] 
[11:07:21] Folding@home Core Shutdown: UNSTABLE_MACHINE
[11:07:25] CoreStatus = 7A (122)
[11:07:25] Sending work to server
[11:07:25] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:07:25] - Read packet limit of 540015616... Set to 524286976.
[11:07:25] - Error: Could not get length of results file work/wuresults_08.dat
[11:07:25] - Error: Could not read unit 08 file. Removing from queue.
[...]
[11:07:33] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:07:33] 
[11:07:33] Assembly optimizations on if available.
[11:07:33] Entering M.D.
[11:07:40] Working on Protein
[11:07:42] Client config found, loading data.
[11:07:42] Starting GUI Server
[11:09:16] Completed 1%
[11:09:16] mdrun_gpu returned 
[11:09:16] NANs detected on GPU
[11:09:16] 
[11:09:16] Folding@home Core Shutdown: UNSTABLE_MACHINE
[11:09:19] CoreStatus = 7A (122)
[11:09:19] Sending work to server
[11:09:19] Project: 5775 (Run 4, Clone 12, Gen 35)
[11:09:19] - Read packet limit of 540015616... Set to 524286976.
[11:09:19] - Error: Could not get length of results file work/wuresults_09.dat
[11:09:19] - Error: Could not read unit 09 file. Removing from queue.
[11:09:19] EUE limit exceeded. Pausing 24 hours.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 5775 (Run 4, Clone 12, Gen 35) : UNSTABLE_MACHINE

Post by toTOW »

Well ... someone else was able to complete it successfully, so the issue is probably with my board :(
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
hrsetrdr
Posts: 112
Joined: Sun Dec 02, 2007 4:29 pm
Location: In the Fold somewhere in SoCal.

Re: Project: 5775 (Run 4, Clone 12, Gen 35) : UNSTABLE_MACHINE

Post by hrsetrdr »

toTOW wrote:Well ... someone else was able to complete it successfully, so the issue is probably with my board :(
You are not alone, I have a couple machines NANing numerous clone/gens of p5775. I'm starting to track time as I am off work today. Others on my home forum are experiencing NANs, I'll see if we can get some more specific information.
Folding rig:Supermicro X9DRD-7LN4F-JBOD | (2) Xeon E5-2670 | 128GB DDR3 ECC Registered

Image
Install Folding@Home on Linux without Python dependancy issues
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

Re: Project: 5775 (Run 4, Clone 12, Gen 35) : UNSTABLE_MACHINE

Post by Jolly-Swagman »

Yes certainly not alone same here with allot of these 5775 WU's and Others too from the 57** series and not just on one Rig, Multiple Rigs now with 57** series WU,s
Image
Post Reply