57** series WU,s Nans Detected(possible Solution)

Moderators: Site Moderators, FAHC Science Team

Post Reply
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

57** series WU,s Nans Detected(possible Solution)

Post by Jolly-Swagman »

As posted earlier, allot of the 57** series WU are going Down with Nans Detected and others with Nonzero force sum on GPU ,, errors also found some with Guarded Run Error

Too Dam many for my liking,,,, waste of resources,,,

Code: Select all

[04:23:45] Project: 5779 (Run 2, Clone 244, Gen 0)
[04:23:45] 
[04:23:45] Assembly optimizations on if available.
[04:23:45] Entering M.D.
[04:23:52] Working on Protein
[04:23:54] Client config found, loading data.
[04:23:54] Starting GUI Server
[04:26:18] Completed 1%
[04:28:42] Completed 2%
[04:31:06] Completed 3%
[04:33:30] Completed 4%
[04:35:54] Completed 5%
[04:38:18] Completed 6%
[04:40:42] Completed 7%
[04:43:06] Completed 8%
[04:45:30] Completed 9%
[04:47:54] Completed 10%
[04:50:18] Completed 11%
[04:52:08] Completed 12%
[04:52:08] mdrun_gpu returned 
[04:52:08] NANs detected on GPU

Another Client
[13:20:32] Project: 5780 (Run 9, Clone 271, Gen 0)
[13:20:32] 
[13:20:33] Assembly optimizations on if available.
[13:20:33] Entering M.D.
[13:20:39] Working on Protein
[13:20:41] Client config found, loading data.
[13:20:41] Starting GUI Server
[13:23:05] Completed 1%
[13:25:29] Completed 2%
[13:27:40] Completed 3%
[13:27:40] mdrun_gpu returned 
[13:27:40] NANs detected on GPU

[13:27:51] Project: 5780 (Run 9, Clone 271, Gen 0)
[13:27:51] 
[13:27:51] Assembly optimizations on if available.
[13:27:51] Entering M.D.
[13:27:58] Working on Protein
[13:28:00] Client config found, loading data.
[13:28:00] Starting GUI Server
[13:30:09] - Autosending finished units... [April 8 13:30:09 UTC]
[13:30:09] Trying to send all finished work units
[13:30:09] + No unsent completed units remaining.
[13:30:09] - Autosend completed
[13:30:09] + Working...
[13:30:24] Completed 1%
[13:32:48] Completed 2%
[13:35:12] Completed 3%
[13:37:11] Completed 4%
[13:37:11] mdrun_gpu returned 
[13:37:11] NANs detected on GPU

---------------
[15:07:02] Project: 5751 (Run 6, Clone 108, Gen 293)
[15:07:02] 
[15:07:03] Assembly optimizations on if available.
[15:07:03] Entering M.D.
[15:07:10] Working on Protein
[15:07:15] Client config found, loading data.
[15:07:15] Starting GUI Server
[15:09:19] Completed 1%
[15:11:23] Completed 2%
[15:13:26] Completed 3%
[15:15:30] Completed 4%
[15:17:34] Completed 5%
[15:19:38] Completed 6%
[15:21:42] Completed 7%
[15:23:46] Completed 8%
[15:25:49] Completed 9%
[15:27:45] Completed 10%
[15:27:45] mdrun_gpu returned 
[15:27:45] NANs detected on GPU

----------------------------------
[20:08:12] Project: 5777 (Run 7, Clone 286, Gen 0)
[20:08:12] 
[20:08:12] Assembly optimizations on if available.
[20:08:12] Entering M.D.
[20:08:19] Working on Protein
[20:08:21] Client config found, loading data.
[20:08:21] Starting GUI Server
[20:10:45] Completed 1%
[20:13:09] Completed 2%
[20:15:33] Completed 3%
[20:17:57] Completed 4%
[20:20:21] Completed 5%
[20:22:34] Completed 6%
[20:22:34] mdrun_gpu returned 
[20:22:34] NANs detected on GPU


Code: Select all

[15:10:41] Project: 5779 (Run 2, Clone 244, Gen 0)
[15:10:41] 
[15:10:41] Assembly optimizations on if available.
[15:10:41] Entering M.D.
[15:10:50] Working on Protein
[15:11:26] Client config found, loading data.
[15:11:27] Starting GUI Server
[15:19:36] Completed 1%
[15:27:40] Completed 2%
[15:35:42] Completed 3%
---------------------------
[09:36:45] Completed 68%
[09:39:09] Completed 69%
[09:41:34] Completed 70%
[09:41:34] mdrun_gpu returned 
[09:41:34] Nonzero force sum on GPU

 Another Machine same client

[03:23:10] Completed 42%
[03:25:34] Completed 43%
[03:27:58] Completed 44%
[03:30:22] Completed 45%
[03:32:46] Completed 46%
[03:32:47] mdrun_gpu returned 
[03:32:47] Nonzero force sum on GPU
Last edited by Jolly-Swagman on Sat Apr 18, 2009 7:30 am, edited 1 time in total.
Image
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 57** series WU,s Nans Detected

Post by toTOW »

Strange, you completed this one fine :

Hi JollySwagman (team 32),
Your WU (P5779 R2 C244 G0) was added to the stats database on 2009-04-08 02:14:31 for 768 points of credit.

Project: 5780 (Run 9, Clone 271, Gen 0) has been completed successfully by another donor.
Project: 5751 (Run 6, Clone 108, Gen 293) has been completed successfully by another donor.

There's no data for Project: 5777 (Run 7, Clone 286, Gen 0) in the DB yet ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

Re: 57** series WU,s Nans Detected

Post by Jolly-Swagman »

Hi toTOW,
Yes my 9800GT 1GB GPU has completed P5779 and also has successfully completed P5777

But my 8800GT 512MB

Code: Select all

[15:12:02] - Digital signature verified
[15:12:02] 
[15:12:02] Project: 5777 (Run 4, Clone 17, Gen 3)
[15:12:02] 
[15:12:02] Assembly optimizations on if available.
[15:12:02] Entering M.D.
[15:12:09] Working on Protein
[15:12:12] Client config found, loading data.
[15:12:12] Starting GUI Server
[15:14:46] Completed 1%
[15:17:20] Completed 2%
[15:19:54] Completed 3%
[15:22:29] Completed 4%
[15:25:03] Completed 5%
[15:27:37] Completed 6%
[15:30:11] Completed 7%
[15:32:45] Completed 8%
[15:35:20] Completed 9%
[15:37:54] Completed 10%
[15:40:28] Completed 11%
[15:43:02] Completed 12%
[15:45:36] Completed 13%
[15:48:10] Completed 14%
[15:50:44] Completed 15%
[15:53:19] Completed 16%
[15:55:53] Completed 17%
[15:58:27] Completed 18%
[16:01:01] Completed 19%
[16:03:35] Completed 20%
[16:06:10] Completed 21%
[16:08:44] Completed 22%
[16:11:18] Completed 23%
[16:13:52] Completed 24%
[16:16:26] Completed 25%
[16:19:00] Completed 26%
[16:21:35] Completed 27%
[16:24:09] Completed 28%
[16:26:43] Completed 29%
[16:29:17] Completed 30%
[16:31:51] Completed 31%
[16:34:25] Completed 32%
[16:37:00] Completed 33%
[16:39:34] Completed 34%
[16:42:08] Completed 35%
[16:44:42] Completed 36%
[16:47:16] Completed 37%
[16:49:50] Completed 38%
[16:52:25] Completed 39%
[16:54:59] Completed 40%
[16:57:33] Completed 41%
[17:00:07] Completed 42%
[17:02:41] Completed 43%
[17:05:15] Completed 44%
[17:07:50] Completed 45%
[17:10:24] Completed 46%
[17:12:58] Completed 47%
[17:15:32] Completed 48%
[17:18:06] Completed 49%
[17:20:41] Completed 50%
[17:23:15] Completed 51%
[17:25:49] Completed 52%
[17:28:23] Completed 53%
[17:30:57] Completed 54%
[17:33:31] Completed 55%
[17:36:05] Completed 56%
[17:38:40] Completed 57%
[17:41:14] Completed 58%
[17:43:48] Completed 59%
[17:46:22] Completed 60%
[17:48:56] Completed 61%
[17:51:30] Completed 62%
[17:54:04] Completed 63%
[17:56:39] Completed 64%
[17:59:13] Completed 65%
[18:01:47] Completed 66%
[18:04:21] Completed 67%
[18:06:55] Completed 68%
[18:09:29] Completed 69%
[18:12:03] Completed 70%
[18:14:38] Completed 71%
[18:17:12] Completed 72%
[18:19:46] Completed 73%
[18:22:20] Completed 74%
[18:24:54] Completed 75%
[18:27:28] Completed 76%
[18:30:03] Completed 77%
[18:32:37] Completed 78%
[18:35:11] Completed 79%
[18:37:45] Completed 80%
[18:40:19] Completed 81%
[18:42:53] Completed 82%
[18:45:28] Completed 83%
[18:48:02] Completed 84%
[18:50:36] Completed 85%
[18:53:10] Completed 86%
[18:55:44] Completed 87%
[18:58:18] Completed 88%
[19:00:53] Completed 89%
[19:03:27] Completed 90%
[19:06:01] Completed 91%
[19:08:35] Completed 92%
[19:11:09] Completed 93%
[19:13:44] Completed 94%
[19:16:18] Completed 95%
[19:18:52] Completed 96%
[19:19:03] Run: exception thrown during GuardedRun
[19:19:03] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[19:19:03] Going to send back what have done -- stepsTotalG=20000000
[19:19:03] Work fraction=0.9607 steps=20000000.
[19:19:07] logfile size=30732 infoLength=30732 edr=0 trr=23
[19:19:07] - Writing 31268 bytes of core data to disk...
[19:19:07] Done: 30756 -> 5953 (compressed to 19.3 percent)
[19:19:07]   ... Done.
Current WU

Code: Select all

[01:47:04] Project: 5774 (Run 14, Clone 38, Gen 6)
[01:47:04] 
[01:47:04] Assembly optimizations on if available.
[01:47:04] Entering M.D.
[01:47:10] Working on Protein
[01:47:12] Client config found, loading data.
[01:47:13] Starting GUI Server
[01:49:47] Completed 1%
[01:52:21] Completed 2%
[01:54:55] Completed 3%
[01:57:29] Completed 4%
[02:00:04] Completed 5%
[02:02:38] Completed 6%
[02:05:12] Completed 7%
[02:07:46] Completed 8%
[02:10:20] Completed 9%
[02:12:55] Completed 10%
[02:15:29] Completed 11%
[02:18:03] Completed 12%
[02:20:37] Completed 13%
[02:23:12] Completed 14%
[02:25:46] Completed 15%
[02:28:20] Completed 16%
[02:30:54] Completed 17%
[02:33:28] Completed 18%
[02:36:03] Completed 19%
[02:38:37] Completed 20%
[02:41:11] Completed 21%
[02:43:45] Completed 22%
[02:46:20] Completed 23%
[02:48:54] Completed 24%
[02:51:28] Completed 25%
[02:54:02] Completed 26%
[02:56:37] Completed 27%
[02:59:11] Completed 28%
[03:01:45] Completed 29%
[03:04:19] Completed 30%
[03:06:53] Completed 31%
[03:09:28] Completed 32%
[03:12:02] Completed 33%
[03:14:36] Completed 34%
[03:17:10] Completed 35%
[03:19:45] Completed 36%
[03:22:19] Completed 37%
[03:24:53] Completed 38%
[03:27:27] Completed 39%
[03:30:01] Completed 40%
[03:32:36] Completed 41%
[03:35:10] Completed 42%
[03:37:44] Completed 43%
[03:40:18] Completed 44%
[03:42:53] Completed 45%
[03:45:27] Completed 46%
[03:48:01] Completed 47%
[03:50:35] Completed 48%
[03:53:10] Completed 49%
[03:55:44] Completed 50%
[03:58:18] Completed 51%
[04:00:52] Completed 52%
[04:03:27] Completed 53%
[04:06:01] Completed 54%
[04:08:35] Completed 55%
[04:11:09] Completed 56%
[04:13:43] Completed 57%
[04:16:18] Completed 58%
[04:18:52] Completed 59%
[04:21:26] Completed 60%
[04:24:00] Completed 61%
[04:26:34] Completed 62%
[04:29:09] Completed 63%
[04:31:43] Completed 64%
[04:34:17] Completed 65%
[04:36:51] Completed 66%
[04:39:26] Completed 67%
[04:42:00] Completed 68%
[04:44:34] Completed 69%
[04:47:08] Completed 70%
[04:49:43] Completed 71%
[04:52:17] Completed 72%
[04:54:51] Completed 73%
[04:57:25] Completed 74%
[04:59:59] Completed 75%
[05:02:34] Completed 76%
[05:05:08] Completed 77%
[05:07:42] Completed 78%
[05:10:16] Completed 79%
[05:12:50] Completed 80%
[05:15:25] Completed 81%
[05:17:59] Completed 82%
[05:20:33] Completed 83%
[05:23:07] Completed 84%
[05:25:42] Completed 85%
[05:28:16] Completed 86%
[05:30:50] Completed 87%
[05:33:25] Completed 88%
[05:35:59] Completed 89%
My 8800GT 256MB

Code: Select all

[10:30:15] Project: 5751 (Run 6, Clone 108, Gen 293)
[10:30:15] 
[10:30:15] Assembly optimizations on if available.
[10:30:15] Entering M.D.
[10:30:21] Working on Protein
[10:30:26] Client config found, loading data.
[10:30:26] Starting GUI Server
[10:32:31] Completed 1%
[10:32:31] mdrun_gpu returned 
[10:32:31] Nonzero force sum on GPU

next
[10:32:41] Project: 5751 (Run 6, Clone 108, Gen 293)
[10:32:41] 
[10:32:41] Assembly optimizations on if available.
[10:32:41] Entering M.D.
[10:32:48] Working on Protein
[10:32:53] Client config found, loading data.
[10:32:53] Starting GUI Server
[10:34:57] Completed 1%
[10:37:01] Completed 2%
[10:39:05] Completed 3%
[10:41:08] Completed 4%
[10:43:12] Completed 5%
[10:43:12] mdrun_gpu returned 
[10:43:12] Nonzero force sum on GPU

next
10:43:25] Project: 5751 (Run 6, Clone 108, Gen 293)
[10:43:25] 
[10:43:25] Assembly optimizations on if available.
[10:43:25] Entering M.D.
[10:43:31] Working on Protein
[10:43:36] Client config found, loading data.
[10:43:36] Starting GUI Server
[10:45:40] Completed 1%
[10:47:44] Completed 2%
[10:49:48] Completed 3%
[10:51:51] Completed 4%
[10:53:55] Completed 5%
[10:55:59] Completed 6%
[10:58:03] Completed 7%
[11:00:07] Completed 8%
[11:02:10] Completed 9%
[11:04:14] Completed 10%
[11:06:18] Completed 11%
[11:08:22] Completed 12%
[11:10:26] Completed 13%
[11:12:30] Completed 14%
[11:14:33] Completed 15%
[11:16:37] Completed 16%
[11:18:41] Completed 17%
[11:20:17] Completed 18%
[11:20:17] mdrun_gpu returned 
[11:20:17] NANs detected on GPU
[11:20:17] 


Then just Shut down got sick of seeing same result,,

So there is obviously a problem with some of the P57** series WU,s

And SUCKs big time to get 96% the go belly up, so what is going on with these, WU,s
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 57** series WU,s Nans Detected

Post by bruce »

Jolly-Swagman wrote:So there is obviously a problem with some of the P57** series WU,s
That is one possibility, but it's certainly not obvious. Perhaps these WUs are just slightly more efficient that the ones you were running when you last adjusted the overclocking of your GPU. If the better utiization of the GPU has pushed you into an unstable condition, the solution is to reduce the overclock -- or provide better cooling -- or whatever it takes to make the hardware stable.

That's just as "obvious" a possibility -- and when you got an error and somebody else completed the same WU, it's somewhat more likely than the series of WUs being the problem.
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

Re: 57** series WU,s Nans Detected

Post by Jolly-Swagman »

Well the Ironic ting is the the Only GPU the is Over Clocked slightly has had NO problems the others are @ stock settings and Both Palit 9800GT 1GB GPU,s have Massive Heat Pipe Heat sinks and Ram cooling to the Rear as well
and run @ 40-46C max The gigabyte 8800GT has Copper Heat pipe Cooling too @ 44-55C depending on WU the Inno 8800GT 512MB is stock settings and cooling gets to 65C Max depending on WU and has 120mm side fan Blowing on to GPU Will Have a TR HR-03 going on as soon as get new Thermal tape, and all GPU,s are (G92)

So as said some WU,s are having Problems and I,m not the only one that is as posted here viewtopic.php?f=19&t=9254,
Also some of my Team32 members also, have had problems.
I dont really care too much that these dont produce points as I mainly do it for the Science, and if there not completed or send bad results, its you guys at Stanford that are loosing out.
I,m just the Bunny (Pensioner)that forks out the Power bill to Run these, in the name of Science, and to help the greater cause, To Help in Finding Cures to Diseases,


Ohh i forgot too,, you already know that there was a Problem with Nans in the P57** series of WU,s posted here back in January viewtopic.php?f=52&t=7965
Image
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

Re: 57** series WU,s Nans Detected

Post by Jolly-Swagman »

Deleted the work Folders and Queue dat on Problematic GPU machines and they seem to be running OK for now,

Seems that the P5777,(Run 7, Clone 286, Gen 0) was the main WU going belly up,
Image
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

Re: 57** series WU,s Nans Detected

Post by Jolly-Swagman »

Well most of the other P57** series have ran OK

BUT still having Problems with this one,, Project: 5777 (Run 10, Clone 153, Gen 24)

Sems to Be only the P5777 WU,s

Code: Select all

[16:07:30] Project: 5777 (Run 10, Clone 153, Gen 24)
[16:07:30] 
[16:07:30] Assembly optimizations on if available.
[16:07:30] Entering M.D.
[16:07:36] Working on Protein
[16:07:38] Client config found, loading data.
[16:07:38] Starting GUI Server
[16:10:03] Completed 1%
[16:12:26] Completed 2%
[16:14:51] Completed 3%
[16:17:15] Completed 4%
[16:19:39] Completed 5%
[16:22:03] Completed 6%
[16:24:27] Completed 7%
[16:26:52] Completed 8%
[16:29:15] Completed 9%
[16:31:39] Completed 10%
[16:34:04] Completed 11%
[16:36:28] Completed 12%
[16:38:52] Completed 13%
[16:41:25] Run: exception thrown during GuardedRun
[16:41:25] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[16:41:25] Going to send back what have done -- stepsTotalG=20000000
[16:41:25] Work fraction=0.1385 steps=20000000.
[16:41:25] logfile size=13169 infoLength=13169 edr=0 trr=23
[16:41:25] - Writing 13705 bytes of core data to disk...
[16:41:25] Done: 13193 -> 4286 (compressed to 32.4 percent)
[16:41:25]   ... Done.
Image
Jolly-Swagman
Posts: 11
Joined: Tue Jul 01, 2008 9:18 am

Re: 57** series WU,s Nans Detected(possible Solution)

Post by Jolly-Swagman »

Found that from some users and ChasR also that there seem to be a problem with Some Clients not clearing out the Work files in Work folder and when they go from WU_00 to WU-09 the without the files being cleaned out The next WU will be WU_00, now seeing that there are orphaned files left this new WU_00 goes belly up and so do any new Wu with 00-09 but once there are no files in the Work folder they run OK.

So after a reinstall of client and close observation that the redundant files are being cleared out ,, the client works OK
Image
shdbcamping
Posts: 81
Joined: Mon Nov 10, 2008 7:57 am
Hardware configuration: XPS 720 Q6600 9800GX2 3gig RAM
750W primary PSU 650W Aux VGA PSU

Re: 57** series WU,s Nans Detected(possible Solution)

Post by shdbcamping »

I have a 8800GT 512 in a Vista32 system. It will run fine for a while and then do this as well. I do the delete the Work folder also as it has all slots populated. I always also delete the core_11 or Core_14 files to force a re DL of the core/s. I have resigned to just performing this procedure when it happens (always when I'm away) as it is not all that common in my case.

I believe the stray files are a result of the failure type for a WU. I've just not paid attention to what that common denominator is.
Post Reply