Page 1 of 1

Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on GPU

Posted: Mon Aug 22, 2011 4:53 am
by GreyWhiskers
I was reissued the same WU right after this Unstable Machine w/NANs on one of the new 7621 WUs. The second time around, it successfully completed, and one more successfully completed, and am at 95% on the next. It should be more stable at this backoff.....

I have had the GTX 560Ti overclocked to a core clock of 953 MHz for months - after the NAN, I backed it off to 930 MHz. BTW, GPU temps have remained good - less than 70 deg C.
[03:28:41] Completed 36400000 out of 40000000 steps (91%).
[03:28:42] mdrun_gpu returned 52
[03:28:42] NANs detected on GPU
[03:28:42]
[03:28:42] Folding@home Core Shutdown: UNSTABLE_MACHINE
[03:28:45] CoreStatus = 7A (122)

Code: Select all


--- Opening Log file [August 20 22:32:34 UTC] 


# Windows GPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.41r2

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\Al\AppData\Roaming\Folding@home-gpu
Arguments: -verbosity 9 -advmethods 

[22:32:34] - Ask before connecting: No
[22:32:34] - User name: GreyWhiskers (Team 0)
[22:32:34] - User ID: 51EA5C9A7EF9D58E
[22:32:34] - Machine ID: 3
[22:32:34] 
[22:32:34] Gpu type=3 species=21.
[22:32:34] Loaded queue successfully.
[22:32:34] Initialization complete
[22:32:34] 
[22:32:34] + Processing work unit
[22:32:34] - Autosending finished units... [August 20 22:32:34 UTC]
[22:32:34] Trying to send all finished work units
[22:32:34] + No unsent completed units remaining.
[22:32:34] - Autosend completed
[22:32:35] Core required: FahCore_15.exe
[22:32:35] Core found.
[22:32:35] Working on queue slot 04 [August 20 22:32:35 UTC]
[22:32:35] + Working ...
[22:32:35] - Calling '.\FahCore_15.exe -dir work/ -suffix 04 -nice 19 -checkpoint 15 -verbose -lifeline 2608 -version 641'

[22:32:35] 
[22:32:35] *------------------------------*
[22:32:35] Folding@Home GPU Core
[22:32:35] Version                2.20 (Tue Aug 2 12:06:37 PDT 2011)
[22:32:35] Build host             SimbiosNvdWin7
[22:32:35] Board Type             NVIDIA/CUDA
[22:32:35] Core                   15
[22:32:35] 
[22:32:35] Window's signal control handler registered.
[22:32:35] Preparing to commence simulation
[22:32:35] - Ensuring status. Please wait.
[22:32:45] - Looking at optimizations...
[22:32:45] - Working with standard loops on this execution.
[22:32:45] - Previous termination of core was improper.
[22:32:45] - Files status OK
[22:32:45] sizeof(CORE_PACKET_HDR) = 512 file=<>
[22:32:45] - Expanded 124817 -> 501826 (decompressed 402.0 percent)
[22:32:45] Called DecompressByteArray: compressed_data_size=124817 data_size=501826, decompressed_data_size=501826 diff=0
[22:32:45] - Digital signature verified
[22:32:45] 
[22:32:45] Project: 7621 (Run 372, Clone 0, Gen 3)
[22:32:45] 
[22:32:45] Entering M.D.
[22:32:47] Will resume from checkpoint file work/wudata_04.ckp
[22:32:47] Tpr hash work/wudata_04.tpr:  135446047 2298027851 2561396342 2124944768 3566292881
[22:32:47] calling fah_main gpuDeviceId=0
[22:32:47] Working on Protein
[22:32:47] Client config found, loading data.
[22:32:47] Starting GUI Server
[22:33:58] Resuming from checkpoint
[22:33:58] fcCheckPointResume: retreived and current tpr file hash:
[22:33:58]    0    135446047    135446047
[22:33:58]    1   2298027851   2298027851
[22:33:58]    2   2561396342   2561396342
[22:33:58]    3   2124944768   2124944768
[22:33:58]    4   3566292881   3566292881
[22:33:58] fcCheckPointResume: file hashes same.
[22:33:58] fcCheckPointResume: state restored.
[22:33:58] fcCheckPointResume: name work/wudata_04.log Verified work/wudata_04.log
[22:33:58] fcCheckPointResume: name work/wudata_04.trr Verified work/wudata_04.trr
[22:33:58] fcCheckPointResume: name work/wudata_04.xtc Verified work/wudata_04.xtc
[22:33:58] fcCheckPointResume: name work/wudata_04.edr Verified work/wudata_04.edr
[22:33:58] fcCheckPointResume: state restored 2

[22:33:58] Resumed from checkpoint
[22:33:58] Setting checkpoint frequency: 400000
[22:33:58] Completed  12800001 out of 40000000 steps (32%).
[22:38:58] Completed  13200000 out of 40000000 steps (33%).
[22:43:57] Completed  13600000 out of 40000000 steps (34%).
[22:48:57] Completed  14000000 out of 40000000 steps (35%).
[22:53:57] Completed  14400000 out of 40000000 steps (36%).
[22:58:56] Completed  14800000 out of 40000000 steps (37%).
[23:03:56] Completed  15200000 out of 40000000 steps (38%).
[23:08:56] Completed  15600000 out of 40000000 steps (39%).
[23:13:56] Completed  16000000 out of 40000000 steps (40%).
[23:18:55] Completed  16400000 out of 40000000 steps (41%).
[23:23:55] Completed  16800000 out of 40000000 steps (42%).
[23:28:54] Completed  17200000 out of 40000000 steps (43%).
[23:33:55] Completed  17600000 out of 40000000 steps (44%).
[23:38:54] Completed  18000000 out of 40000000 steps (45%).
[23:43:54] Completed  18400000 out of 40000000 steps (46%).
[23:48:53] Completed  18800000 out of 40000000 steps (47%).
[23:53:53] Completed  19200000 out of 40000000 steps (48%).
[23:58:53] Completed  19600000 out of 40000000 steps (49%).
[00:03:53] Completed  20000000 out of 40000000 steps (50%).
[00:08:52] Completed  20400000 out of 40000000 steps (51%).
[00:13:52] Completed  20800000 out of 40000000 steps (52%).
[00:18:52] Completed  21200000 out of 40000000 steps (53%).
[00:23:52] Completed  21600000 out of 40000000 steps (54%).
[00:28:51] Completed  22000000 out of 40000000 steps (55%).
[00:33:51] Completed  22400000 out of 40000000 steps (56%).
[00:38:51] Completed  22800000 out of 40000000 steps (57%).
[00:43:50] Completed  23200000 out of 40000000 steps (58%).
[00:48:50] Completed  23600000 out of 40000000 steps (59%).
[00:53:50] Completed  24000000 out of 40000000 steps (60%).
[00:58:50] Completed  24400000 out of 40000000 steps (61%).
[01:03:49] Completed  24800000 out of 40000000 steps (62%).
[01:08:49] Completed  25200000 out of 40000000 steps (63%).
[01:13:49] Completed  25600000 out of 40000000 steps (64%).
[01:18:49] Completed  26000000 out of 40000000 steps (65%).
[01:23:48] Completed  26400000 out of 40000000 steps (66%).
[01:28:48] Completed  26800000 out of 40000000 steps (67%).
[01:33:48] Completed  27200000 out of 40000000 steps (68%).
[01:38:48] Completed  27600000 out of 40000000 steps (69%).
[01:43:47] Completed  28000000 out of 40000000 steps (70%).
[01:48:47] Completed  28400000 out of 40000000 steps (71%).
[01:53:46] Completed  28800000 out of 40000000 steps (72%).
[01:58:46] Completed  29200000 out of 40000000 steps (73%).
[02:03:46] Completed  29600000 out of 40000000 steps (74%).
[02:08:46] Completed  30000000 out of 40000000 steps (75%).
[02:13:45] Completed  30400000 out of 40000000 steps (76%).
[02:18:45] Completed  30800000 out of 40000000 steps (77%).
[02:23:45] Completed  31200000 out of 40000000 steps (78%).
[02:28:45] Completed  31600000 out of 40000000 steps (79%).
[02:33:44] Completed  32000000 out of 40000000 steps (80%).
[02:38:44] Completed  32400000 out of 40000000 steps (81%).
[02:43:44] Completed  32800000 out of 40000000 steps (82%).
[02:48:43] Completed  33200000 out of 40000000 steps (83%).
[02:53:43] Completed  33600000 out of 40000000 steps (84%).
[02:58:43] Completed  34000000 out of 40000000 steps (85%).
[03:03:42] Completed  34400000 out of 40000000 steps (86%).
[03:08:42] Completed  34800000 out of 40000000 steps (87%).
[03:13:42] Completed  35200000 out of 40000000 steps (88%).
[03:18:42] Completed  35600000 out of 40000000 steps (89%).
[03:23:41] Completed  36000000 out of 40000000 steps (90%).
[03:28:41] Completed  36400000 out of 40000000 steps (91%).
[03:28:42] mdrun_gpu returned 52
[03:28:42] NANs detected on GPU
[03:28:42] 
[03:28:42] Folding@home Core Shutdown: UNSTABLE_MACHINE
[03:28:45] CoreStatus = 7A (122)
[03:28:45] Sending work to server
[03:28:45] Project: 7621 (Run 372, Clone 0, Gen 3)
[03:28:45] - Read packet limit of 540015616... Set to 524286976.
[03:28:45] - Error: Could not get length of results file work/wuresults_04.dat
[03:28:45] - Error: Could not read unit 04 file. Removing from queue.

[03:28:51] 

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Mon Aug 22, 2011 5:06 am
by geokilla
The new WU are a lot more stressful than the old GPU WUs. Therefore getting crashes are not uncommon. My GTX 460 used to fold fine at 62C with 860 core. Now I have to back it down to 840 core and the temps are at 70C. Fan speed is auto.

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Mon Aug 22, 2011 12:26 pm
by uncle fuzzy
The higher the number on your card, the less likely you will be to see a temperature rise. However, all cards seem to be equally open to the OC NAN. I have less capable cards, but but they fold these fine at lower clocks.

GTX460- dropped from 850 to 825 (72C, max fan)
GTS450- dropped form 950 to 875 (75C, max fan)

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Mon Aug 22, 2011 2:52 pm
by schwancr
Thanks for your input here everyone, it seems that turning down the clock rate may be necessary to finish these WUs.

-Christian

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Mon Aug 22, 2011 4:10 pm
by GreyWhiskers
BTW, another quick stat on folding the 7620/7621s. I'm seeing through the stats in both MSI Afterburner and GPU-Z that the memory usage is 513 MB (out of the 1024 MB). This is signifigantly higher than I remember getting on the p680x WUs. I don't have the exact mem usage numbers for the p680x, but I do remember noting that it seemed very low the last time I looked.

One other observation, similar to others. The system seems pretty sluggish with the SMP and GPU together compared with before. It may be the video, it may be the CPU, it may be the wireless keyboard and mouse - but with SMP 8 and the 7620/7621s, the system is getting a real workout.

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Tue Aug 23, 2011 9:28 am
by ra_alfaomega
With a 460 Hawk at 68xx projects my card was overclocked to 925 and the max temp was 77. Now I have a 7621 project and at 912mhz my temps are around 88C. I am disappointed about the ppd, because is about the same with 68xx projects. Considering the size of the project and the heat that is produced I think that the points are not high enough. So I will not fold this projects anymore until the points system will be reconsidered. Anyone else has the same opinion about the points for this projects?

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Tue Aug 23, 2011 3:55 pm
by bruce
The points are determined by the official benchmarking policy, which has not changed (and probably won't). The fact that FAH is now able to make more effective use of the GPU resources is considered a good thing, but it's unrelated to the benchmarking process. Others have discovered that although they have been able to overclocking their GPU when it's running only inefficient projects, they now know why the manufacturer established the standard clock rate.

If you choose to stop folding, that's your own personal decision, but please don't attempt to recruit others to do the same.

Re: Project: 7621 (Run 372, Clone 0, Gen 3) NANs detected on

Posted: Wed Aug 24, 2011 10:12 am
by ra_alfaomega
Everybody has a choice,and I don't want anybody to follow my opinion,just wanted to know if I am the only one who thinks that a much bigger project deserve more points. I don't want to quit folding, and I fold for 2 years(almost 4 million points). Since yesterday I have a better airflow in my case and my temps on gpu on 7621 project dropped almost 10C.Maybe I will reconsider folding 762X projects again :) the more so as is so important for researchers in this period of time . Thank you Bruce for your answer!