Page 1 of 1

7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Mon Apr 30, 2012 2:33 pm
by metalmayhem
First of all, No I'm not overclocked. Project: 7644 (Run 510, Clone 0, Gen 5)

My GPU have been trying to finish this WU for 4 days! I think it failed more than 10 times

The GTX 580 folded fine with driver 296.10 for a week at 888/2055/1100mV until this specific WU came up. After 2nd fail I dropped back to stock.

That didn't help. So I tried with 301.24 Beta w/ stock clocks + voltage. Still fail.

Then I tried driver 285.62 as it seems very reliable as posted in this forum. And yet I keep getting the same error. I am now thinking of dropping this WU.

The most frustrating thing is that it fails around 75-85% of WU completion which takes about 8-9 hours to get to.

Any suggestions?

I should also mention that I've been getting about 28K less PPD combined between my CPU & GPU since I started using V7 client. I'm gonna drop back to GPU3 v6.41. I checked there's only one instance of SMP client running.

Latest fail:

Code: Select all

12:47:53:WU00:FS00:0x15:Completed   1825000 out of 2500000 steps (73%).
12:55:27:WU00:FS00:0x15:Completed   1850000 out of 2500000 steps (74%).
13:03:03:WU00:FS00:0x15:Completed   1875000 out of 2500000 steps (75%).
13:10:35:WU00:FS00:0x15:Completed   1900000 out of 2500000 steps (76%).
13:18:10:WU00:FS00:0x15:Completed   1925000 out of 2500000 steps (77%).
13:34:57:WU00:FS00:0x15:Completed   1950000 out of 2500000 steps (78%).
13:34:57:WU00:FS00:0x15:mdrun_gpu returned 52
13:34:57:WU00:FS00:0x15:NANs detected on GPU
13:34:57:WU00:FS00:0x15:
13:34:57:WU00:FS00:0x15:Folding@home Core Shutdown: UNSTABLE_MACHINE
13:34:57:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
13:34:57:WU00:FS00:Starting

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Mon Apr 30, 2012 4:38 pm
by sortofageek
Thanks for the report. As there is no info in the database thus far, I will mark Project: 7644 (Run 510, Clone 0, Gen 5) for follow-up.

It might help to know your folding name and team number when reviewing database results on this WU.

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Mon Apr 30, 2012 5:15 pm
by metalmayhem
Thanks for replying. Here's the requested info:

Folding name: metalmayhem1
Team no.: 37726

As for this WU, what do I do? I sure am not making any progress for either Science or my points. Do I drop it?

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Mon Apr 30, 2012 5:17 pm
by sortofageek
Thanks for the info. I can't see enough of your log to tell where you are and where you've been with this WU. I'm not one to consider dumping a WU as long as I believe there might be a chance for success, though.

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Mon Apr 30, 2012 6:40 pm
by metalmayhem
sortofageek wrote:I can't see enough of your log to tell where you are and where you've been with this WU.
I posted only a few steps before the unit failed. I don't think I can provide with log of previous fails since I restarted several times while troubleshooting.
sortofageek wrote:I'm not one to consider dumping a WU as long as I believe there might be a chance for success, though.
This I understand considering your position on the forum.

I'll give this WU maybe a day or a bit more with some overvolting and underclocking (i really don't know why) to finish this WU. Can't sit with this one forever given that the expiration date is 05/15, no explanation as to why this WU is failing repeatedly and unknown "chance for success" factor. Also I am in need of priming up my hardware as Chimp Challenge is coming up.

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Mon Apr 30, 2012 11:16 pm
by bruce
metalmayhem wrote:I posted only a few steps before the unit failed. I don't think I can provide with log of previous fails since I restarted several times while troubleshooting.
For future reference, "several times" won't overwrite your log. By default, you should find logs from the past 16 restarts all neatly arranged by data and time in the "logs" directory.

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Tue May 01, 2012 9:36 am
by metalmayhem
bruce wrote:For future reference, "several times" won't overwrite your log. By default, you should find logs from the past 16 restarts all neatly arranged by data and time in the "logs" directory.
Thanks for the tip. Yes I found the logs in the designated folder. Do you guys want me to post all of it?

Even with mild underclock & overvolts, the WU failed 3 more times in the last 17 hours.

Re: 7644 (510, 0, 5) NANs detected on GPU UNSTABLE_MACHINE

Posted: Wed Jun 06, 2012 2:41 pm
by sortofageek
Still nothing in the database. It must be a bad WU. :(