Page 1 of 2

Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Tue Nov 16, 2010 9:54 pm
by shunter
I have run some 380 units (including 6800s) on this GTX 460 with no problem but got my first 6811 yesterday and so far it has failed about 10 times due to nans being detected. Normally this happens between 9 & 11 % but once it got as far as 55% and other times it has been as low as 6%. In the past I have deleted unit and reloaded a few times and get a new unit then report the old as possibly faulty. This time I have been unable to do this - I am not cherry picking but if I cannot complete this unt and cannot get another there's no point in folding on this card - any ideas / suggestions please?
Thanks
Shunter
PS Can provide logs if required

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Tue Nov 16, 2010 10:17 pm
by sswilson
Is your machine overclocked at all? If so, try running it at stock speeds.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Tue Nov 16, 2010 11:09 pm
by shunter
PC definitely overclocked as its from OverclockersUK and an i7 930 running at 4011 MHz. Not sure about the card but believe it may be but as not a real techie I'm not sure. Is there anything I can run that will tell me?
Thanks

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 1:35 am
by HenryW
Yup, download and run GPU-Z and it give you all needed info on the graphics card. While you're at it you can download MSI Afterburner to adjust GPU core clock if needed.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 4:30 pm
by sortofageek
There is no data back on Project: 6811 (Run 0 Clone 139 Gen 1) yet, but I'll mark it for followup checks.

Thanks for your report. It's always a good idea to use the -verbosity 9 flag and to include your relevant logs when posting a trouble report, as well as your machine specs and info about overclocking and/or other special configurations.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 5:44 pm
by shunter
Thanks for the replies.
Card is a GTX 460 running at default per GPU-Z, ie clock 763MHz, Memory 950MHz and Shader 1526MHz. I've never had much luck overclocking my own so if they are then I have bought them in at that level. This is the first project on the card that I have noticed card fan varying in speed of operation.

Have started unit again and its failed now at less than 1% - extract of logfile is below

Code: Select all

--- Opening Log file [November 17 17:17:35 UTC] 


# Windows GPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.30r2

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\David\AppData\Roaming\Folding@home-gpu
Arguments: -advmethods -verbosity 9 

[17:17:35] - Ask before connecting: No
[17:17:35] - User name: shunter (Team 46590)
[17:17:35] - User ID: 1C43E3D90E93A5B1
[17:17:35] - Machine ID: 2
[17:17:35] 
[17:17:35] Gpu type=3 species=30.
[17:17:35] Work directory not found. Creating...
[17:17:35] Could not open work queue, generating new queue...
[17:17:36] Initialization complete
[17:17:36] - Preparing to get new work unit...
[17:17:36] Cleaning up work directory
[17:17:36] - Autosending finished units... [November 17 17:17:36 UTC]
[17:17:36] Trying to send all finished work units
[17:17:36] + No unsent completed units remaining.
[17:17:36] - Autosend completed
[17:17:36] + Attempting to get work packet
[17:17:36] Passkey found
[17:17:36] - Will indicate memory of 6142 MB
[17:17:36] Gpu type=3 species=30.
[17:17:36] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 10, Stepping: 5
[17:17:36] - Connecting to assignment server
[17:17:36] Connecting to http://assign-GPU.stanford.edu:8080/
[17:17:37] Posted data.
[17:17:37] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[17:17:37] + News From Folding@Home: Welcome to Folding@Home
[17:17:37] Loaded queue successfully.
[17:17:37] Gpu type=3 species=30.
[17:17:37] Sent data
[17:17:37] Connecting to http://171.64.65.64:8080/
[17:17:38] Posted data.
[17:17:38] Initial: 0000; - Receiving payload (expected size: 131502)
[17:17:39] - Downloaded at ~128 kB/s
[17:17:39] - Averaged speed for that direction ~128 kB/s
[17:17:39] + Received work.
[17:17:39] + Closed connections
[17:17:39] 
[17:17:39] + Processing work unit
[17:17:39] Core required: FahCore_15.exe
[17:17:39] Core found.
[17:17:39] Working on queue slot 01 [November 17 17:17:39 UTC]
[17:17:39] + Working ...
[17:17:39] - Calling '.\FahCore_15.exe -dir work/ -suffix 01 -nice 19 -checkpoint 15 -verbose -lifeline 4272 -version 630'

[17:17:40] 
[17:17:40] *------------------------------*
[17:17:40] Folding@Home GPU Core
[17:17:40] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[17:17:40] 
[17:17:40] Build host: SimbiosNvdWin7
[17:17:40] Board Type: NVIDIA/CUDA
[17:17:40] Core      : x=15
[17:17:40]  Window's signal control handler registered.
[17:17:40] Preparing to commence simulation
[17:17:40] - Looking at optimizations...
[17:17:40] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[17:17:40] - Created dyn
[17:17:40] - Files status OK
[17:17:40] sizeof(CORE_PACKET_HDR) = 512 file=<>
[17:17:40] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[17:17:40] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[17:17:40] - Digital signature verified
[17:17:40] 
[17:17:40] Project: 6811 (Run 0, Clone 139, Gen 1)
[17:17:40] 
[17:17:40] Assembly optimizations on if available.
[17:17:40] Entering M.D.
[17:17:42] Tpr hash work/wudata_01.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[17:17:42] Working on Protein
[17:17:42] Client config found, loading data.
[17:17:42] Starting GUI Server
[17:19:19] ***** Got a SIGTERM signal (2)
[17:19:19] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [November 17 17:21:16 UTC] 


# Windows GPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.30r2

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Users\David\AppData\Roaming\Folding@home-gpu
Arguments: -advmethods -verbosity 9 

[17:21:16] - Ask before connecting: No
[17:21:16] - User name: shunter (Team 46590)
[17:21:16] - User ID: 1C43E3D90E93A5B1
[17:21:16] - Machine ID: 2
[17:21:16] 
[17:21:16] Gpu type=3 species=30.
[17:21:16] Loaded queue successfully.
[17:21:16] Initialization complete
[17:21:16] 
[17:21:16] + Processing work unit
[17:21:16] - Autosending finished units... [November 17 17:21:16 UTC]
[17:21:16] Trying to send all finished work units
[17:21:16] + No unsent completed units remaining.
[17:21:16] - Autosend completed
[17:21:16] Core required: FahCore_15.exe
[17:21:16] Core found.
[17:21:16] Working on queue slot 01 [November 17 17:21:16 UTC]
[17:21:16] + Working ...
[17:21:16] - Calling '.\FahCore_15.exe -dir work/ -suffix 01 -nice 19 -checkpoint 15 -verbose -lifeline 1244 -version 630'

[17:21:16] 
[17:21:16] *------------------------------*
[17:21:16] Folding@Home GPU Core
[17:21:16] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[17:21:16] 
[17:21:16] Build host: SimbiosNvdWin7
[17:21:16] Board Type: NVIDIA/CUDA
[17:21:16] Core      : x=15
[17:21:16]  Window's signal control handler registered.
[17:21:16] Preparing to commence simulation
[17:21:16] - Looking at optimizations...
[17:21:16] - Files status OK
[17:21:16] sizeof(CORE_PACKET_HDR) = 512 file=<>
[17:21:16] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[17:21:16] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[17:21:16] - Digital signature verified
[17:21:16] 
[17:21:16] Project: 6811 (Run 0, Clone 139, Gen 1)
[17:21:16] 
[17:21:16] Assembly optimizations on if available.
[17:21:16] Entering M.D.
[17:21:18] Will resume from checkpoint file work/wudata_01.ckp
[17:21:18] Tpr hash work/wudata_01.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[17:21:18] Working on Protein
[17:21:18] Client config found, loading data.
[17:21:18] Starting GUI Server
[17:21:20] Resuming from checkpoint
[17:21:20] fcCheckPointResume: retreived and current tpr file hash:
[17:21:20]    0   1713532816   1713532816
[17:21:20]    1   1773889973   1773889973
[17:21:20]    2   3265068694   3265068694
[17:21:20]    3   1042100616   1042100616
[17:21:20]    4   1966209275   1966209275
[17:21:20] fcCheckPointResume: file hashes same.
[17:21:20] fcCheckPointResume: state restored.
[17:21:20] fcCheckPointResume: name work/wudata_01.log Verified work/wudata_01.log
[17:21:20] fcCheckPointResume: name work/wudata_01.trr Verified work/wudata_01.trr
[17:21:20] fcCheckPointResume: name work/wudata_01.xtc Verified work/wudata_01.xtc
[17:21:20] fcCheckPointResume: name work/wudata_01.edr Verified work/wudata_01.edr
[17:21:20] fcCheckPointResume: state restored 2
[17:21:20] Resumed from checkpoint
[17:22:07] Gpu type=3 species=30.
[17:31:44] Completed    500000 out of 50000000 steps (1%).
[17:31:44] mdrun_gpu returned 52
[17:31:44] NANs detected on GPU
[17:31:44] 
[17:31:44] Folding@home Core Shutdown: UNSTABLE_MACHINE
[17:31:49] CoreStatus = 7A (122)
[17:31:49] Sending work to server
[17:31:49] Project: 6811 (Run 0, Clone 139, Gen 1)
[17:31:49] - Error: Could not get length of results file work/wuresults_01.dat
[17:31:49] - Error: Could not read unit 01 file. Removing from queue.
[17:31:49] Trying to send all finished work units
[17:31:49] + No unsent completed units remaining.
[17:31:49] - Preparing to get new work unit...
[17:31:49] Cleaning up work directory
[17:31:49] + Attempting to get work packet
[17:31:49] Passkey found
[17:31:49] - Will indicate memory of 6142 MB
[17:31:49] Gpu type=3 species=30.
[17:31:49] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 10, Stepping: 5
[17:31:49] - Connecting to assignment server
[17:31:49] Connecting to http://assign-GPU.stanford.edu:8080/
[17:31:50] Posted data.
[17:31:50] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[17:31:50] + News From Folding@Home: Welcome to Folding@Home
[17:31:50] Loaded queue successfully.
[17:31:50] Gpu type=3 species=30.
[17:31:50] Sent data
[17:31:50] Connecting to http://171.64.65.64:8080/
[17:31:50] Posted data.
[17:31:50] Initial: 0000; - Receiving payload (expected size: 131502)
[17:31:52] - Downloaded at ~64 kB/s
[17:31:52] - Averaged speed for that direction ~96 kB/s
[17:31:52] + Received work.
[17:31:52] Trying to send all finished work units
[17:31:52] + No unsent completed units remaining.
[17:31:52] + Closed connections
[17:31:57] 
[17:31:57] + Processing work unit
[17:31:57] Core required: FahCore_15.exe
[17:31:57] Core found.
[17:31:57] Working on queue slot 02 [November 17 17:31:57 UTC]
[17:31:57] + Working ...
[17:31:57] - Calling '.\FahCore_15.exe -dir work/ -suffix 02 -nice 19 -checkpoint 15 -verbose -lifeline 1244 -version 630'

[17:31:57] 
[17:31:57] *------------------------------*
[17:31:57] Folding@Home GPU Core
[17:31:57] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[17:31:57] 
[17:31:57] Build host: SimbiosNvdWin7
[17:31:57] Board Type: NVIDIA/CUDA
[17:31:57] Core      : x=15
[17:31:57]  Window's signal control handler registered.
[17:31:57] Preparing to commence simulation
[17:31:57] - Looking at optimizations...
[17:31:57] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[17:31:57] - Created dyn
[17:31:57] - Files status OK
[17:31:57] sizeof(CORE_PACKET_HDR) = 512 file=<>
[17:31:57] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[17:31:57] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[17:31:57] - Digital signature verified
[17:31:57] 
[17:31:57] Project: 6811 (Run 0, Clone 139, Gen 1)
[17:31:57] 
[17:31:57] Assembly optimizations on if available.
[17:31:57] Entering M.D.
[17:31:59] Tpr hash work/wudata_02.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[17:31:59] Working on Protein
[17:31:59] Client config found, loading data.
[17:31:59] Starting GUI Server

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 6:00 pm
by shunter
Please find below the logfile from the last successful unit completion and the details of this unit from when first downloaded until I realised it was crashing and closed it down.

Code: Select all

16:31:30] Completed  49999999 out of 50000000 steps (100%).
[16:31:30] Finished fah_main
[16:31:30] 
[16:31:30] Successful run
[16:31:30] DynamicWrapper: Finished Work Unit: sleep=10000
[16:31:40] Reserved 2443524 bytes for xtc file; Cosm status=0
[16:31:40] Allocated 2443524 bytes for xtc file
[16:31:40] - Reading up to 2443524 from "work/wudata_01.xtc": Read 2443524
[16:31:40] Read 2443524 bytes from xtc file; available packet space=783986940
[16:31:40] xtc file hash check passed.
[16:31:40] Reserved 75840 75840 783986940 bytes for arc file=<work/wudata_01.trr> Cosm status=0
[16:31:40] Allocated 75840 bytes for arc file
[16:31:40] - Reading up to 75840 from "work/wudata_01.trr": Read 75840
[16:31:40] Read 75840 bytes from arc file; available packet space=783911100
[16:31:40] trr file hash check passed.
[16:31:40] Allocated 544 bytes for edr file
[16:31:40] Read bedfile
[16:31:40] edr file hash check passed.
[16:31:40] Allocated 120131 bytes for logfile
[16:31:40] Read logfile
[16:31:40] GuardedRun: success in DynamicWrapper
[16:31:40] GuardedRun: done
[16:31:40] Run: GuardedRun completed.
[16:31:41] + Opened results file
[16:31:41] - Writing 2640551 bytes of core data to disk...
[16:31:42] Done: 2640039 -> 2481396 (compressed to 93.9 percent)
[16:31:42]   ... Done.
[16:31:42] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[16:31:42] Shutting down core 
[16:31:42] 
[16:31:42] Folding@home Core Shutdown: FINISHED_UNIT
[16:31:47] CoreStatus = 64 (100)
[16:31:47] Unit 1 finished with 99 percent of time to deadline remaining.
[16:31:47] Updated performance fraction: 0.987384
[16:31:47] Sending work to server
[16:31:47] Project: 6800 (Run 3455, Clone 1, Gen 1)


[16:31:47] + Attempting to send results [November 15 16:31:47 UTC]
[16:31:47] - Reading file work/wuresults_01.dat from core
[16:31:47]   (Read 2481908 bytes from disk)
[16:31:47] Gpu type=3 species=30.
[16:31:47] Connecting to http://171.64.65.64:8080/
[16:32:01] Posted data.
[16:32:01] Initial: 0000; - Uploaded at ~173 kB/s
[16:32:01] - Averaged speed for that direction ~160 kB/s
[16:32:01] + Results successfully sent
[16:32:01] Thank you for your contribution to Folding@Home.
[16:32:01] + Number of Units Completed: 380

[16:32:05] Trying to send all finished work units
[16:32:05] + No unsent completed units remaining.
[16:32:05] - Preparing to get new work unit...
[16:32:05] Cleaning up work directory
[16:32:05] + Attempting to get work packet
[16:32:05] Passkey found
[16:32:05] - Will indicate memory of 6142 MB
[16:32:05] Gpu type=3 species=30.
[16:32:05] - Connecting to assignment server
[16:32:05] Connecting to http://assign-GPU.stanford.edu:8080/
[16:32:06] Posted data.
[16:32:06] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[16:32:06] + News From Folding@Home: Welcome to Folding@Home
[16:32:06] Loaded queue successfully.
[16:32:06] Gpu type=3 species=30.
[16:32:06] Sent data
[16:32:06] Connecting to http://171.64.65.64:8080/
[16:32:07] Posted data.
[16:32:07] Initial: 0000; - Receiving payload (expected size: 131502)
[16:32:08] - Downloaded at ~128 kB/s
[16:32:08] - Averaged speed for that direction ~62 kB/s
[16:32:08] + Received work.
[16:32:08] Trying to send all finished work units
[16:32:08] + No unsent completed units remaining.
[16:32:08] + Closed connections
[16:32:08] 
[16:32:08] + Processing work unit
[16:32:08] Core required: FahCore_15.exe
[16:32:08] Core found.
[16:32:08] Working on queue slot 02 [November 15 16:32:08 UTC]
[16:32:08] + Working ...
[16:32:08] - Calling '.\FahCore_15.exe -dir work/ -suffix 02 -nice 19 -checkpoint 15 -verbose -lifeline 2968 -version 630'

[16:32:09] 
[16:32:09] *------------------------------*
[16:32:09] Folding@Home GPU Core
[16:32:09] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[16:32:09] 
[16:32:09] Build host: SimbiosNvdWin7
[16:32:09] Board Type: NVIDIA/CUDA
[16:32:09] Core      : x=15
[16:32:09]  Window's signal control handler registered.
[16:32:09] Preparing to commence simulation
[16:32:09] - Looking at optimizations...
[16:32:09] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[16:32:09] - Created dyn
[16:32:09] - Files status OK
[16:32:09] sizeof(CORE_PACKET_HDR) = 512 file=<>
[16:32:09] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[16:32:09] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[16:32:09] - Digital signature verified
[16:32:09] 
[16:32:09] Project: 6811 (Run 0, Clone 139, Gen 1)
[16:32:09] 
[16:32:09] Assembly optimizations on if available.
[16:32:09] Entering M.D.
[16:32:11] Tpr hash work/wudata_02.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[16:32:11] Working on Protein
[16:32:11] Client config found, loading data.
[16:32:11] Starting GUI Server
[16:42:52] Completed    500000 out of 50000000 steps (1%).
[16:53:35] Completed   1000000 out of 50000000 steps (2%).
[17:04:15] Completed   1500000 out of 50000000 steps (3%).
[17:08:30] - Autosending finished units... [November 15 17:08:30 UTC]
[17:08:30] Trying to send all finished work units
[17:08:30] + No unsent completed units remaining.
[17:08:30] - Autosend completed
[17:08:30] + Working...
[17:14:53] Completed   2000000 out of 50000000 steps (4%).
[17:25:43] Completed   2500000 out of 50000000 steps (5%).
[17:36:30] Completed   3000000 out of 50000000 steps (6%).
[17:47:15] Completed   3500000 out of 50000000 steps (7%).
[17:57:55] Completed   4000000 out of 50000000 steps (8%).
[18:08:05] Completed   4500000 out of 50000000 steps (9%).
[18:08:05] mdrun_gpu returned 52
[18:08:05] NANs detected on GPU
[18:08:05] 
[18:08:05] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:08:09] CoreStatus = 7A (122)
[18:08:09] Sending work to server
[18:08:09] Project: 6811 (Run 0, Clone 139, Gen 1)
[18:08:09] - Error: Could not get length of results file work/wuresults_02.dat
[18:08:09] - Error: Could not read unit 02 file. Removing from queue.
[18:08:09] Trying to send all finished work units
[18:08:09] + No unsent completed units remaining.
[18:08:09] - Preparing to get new work unit...
[18:08:09] Cleaning up work directory
[18:08:09] + Attempting to get work packet
[18:08:09] Passkey found
[18:08:09] - Will indicate memory of 6142 MB
[18:08:09] Gpu type=3 species=30.
[18:08:09] - Connecting to assignment server
[18:08:09] Connecting to http://assign-GPU.stanford.edu:8080/
[18:08:10] Posted data.
[18:08:10] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[18:08:10] + News From Folding@Home: Welcome to Folding@Home
[18:08:10] Loaded queue successfully.
[18:08:10] Gpu type=3 species=30.
[18:08:10] Sent data
[18:08:10] Connecting to http://171.64.65.64:8080/
[18:08:11] Posted data.
[18:08:11] Initial: 0000; - Receiving payload (expected size: 131502)
[18:08:12] - Downloaded at ~128 kB/s
[18:08:12] - Averaged speed for that direction ~75 kB/s
[18:08:12] + Received work.
[18:08:12] Trying to send all finished work units
[18:08:12] + No unsent completed units remaining.
[18:08:12] + Closed connections
[18:08:17] 
[18:08:17] + Processing work unit
[18:08:17] Core required: FahCore_15.exe
[18:08:17] Core found.
[18:08:17] Working on queue slot 03 [November 15 18:08:17 UTC]
[18:08:17] + Working ...
[18:08:17] - Calling '.\FahCore_15.exe -dir work/ -suffix 03 -nice 19 -checkpoint 15 -verbose -lifeline 2968 -version 630'

[18:08:18] 
[18:08:18] *------------------------------*
[18:08:18] Folding@Home GPU Core
[18:08:18] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[18:08:18] 
[18:08:18] Build host: SimbiosNvdWin7
[18:08:18] Board Type: NVIDIA/CUDA
[18:08:18] Core      : x=15
[18:08:18]  Window's signal control handler registered.
[18:08:18] Preparing to commence simulation
[18:08:18] - Looking at optimizations...
[18:08:18] DeleteFrameFiles: successfully deleted file=work/wudata_03.ckp
[18:08:18] - Created dyn
[18:08:18] - Files status OK
[18:08:18] sizeof(CORE_PACKET_HDR) = 512 file=<>
[18:08:18] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[18:08:18] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[18:08:18] - Digital signature verified
[18:08:18] 
[18:08:18] Project: 6811 (Run 0, Clone 139, Gen 1)
[18:08:18] 
[18:08:18] Assembly optimizations on if available.
[18:08:18] Entering M.D.
[18:08:20] Tpr hash work/wudata_03.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[18:08:20] Working on Protein
[18:08:20] Client config found, loading data.
[18:08:20] Starting GUI Server
[18:18:57] Completed    500000 out of 50000000 steps (1%).
[18:29:40] Completed   1000000 out of 50000000 steps (2%).
[18:40:02] Completed   1500000 out of 50000000 steps (3%).
[18:40:02] mdrun_gpu returned 52
[18:40:02] NANs detected on GPU
[18:40:02] 
[18:40:02] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:40:06] CoreStatus = 7A (122)
[18:40:06] Sending work to server
[18:40:06] Project: 6811 (Run 0, Clone 139, Gen 1)
[18:40:06] - Error: Could not get length of results file work/wuresults_03.dat
[18:40:06] - Error: Could not read unit 03 file. Removing from queue.
[18:40:06] Trying to send all finished work units
[18:40:06] + No unsent completed units remaining.
[18:40:06] - Preparing to get new work unit...
[18:40:06] Cleaning up work directory
[18:40:06] + Attempting to get work packet
[18:40:06] Passkey found
[18:40:06] - Will indicate memory of 6142 MB
[18:40:06] Gpu type=3 species=30.
[18:40:06] - Connecting to assignment server
[18:40:06] Connecting to http://assign-GPU.stanford.edu:8080/
[18:40:07] Posted data.
[18:40:07] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[18:40:07] + News From Folding@Home: Welcome to Folding@Home
[18:40:07] Loaded queue successfully.
[18:40:07] Gpu type=3 species=30.
[18:40:07] Sent data
[18:40:07] Connecting to http://171.64.65.64:8080/
[18:40:08] Posted data.
[18:40:08] Initial: 0000; - Receiving payload (expected size: 131502)
[18:40:09] - Downloaded at ~128 kB/s
[18:40:09] - Averaged speed for that direction ~86 kB/s
[18:40:09] + Received work.
[18:40:09] Trying to send all finished work units
[18:40:09] + No unsent completed units remaining.
[18:40:09] + Closed connections
[18:40:14] 
[18:40:14] + Processing work unit
[18:40:14] Core required: FahCore_15.exe
[18:40:14] Core found.
[18:40:14] Working on queue slot 04 [November 15 18:40:14 UTC]
[18:40:14] + Working ...
[18:40:14] - Calling '.\FahCore_15.exe -dir work/ -suffix 04 -nice 19 -checkpoint 15 -verbose -lifeline 2968 -version 630'

[18:40:14] 
[18:40:14] *------------------------------*
[18:40:14] Folding@Home GPU Core
[18:40:14] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[18:40:14] 
[18:40:14] Build host: SimbiosNvdWin7
[18:40:14] Board Type: NVIDIA/CUDA
[18:40:14] Core      : x=15
[18:40:14]  Window's signal control handler registered.
[18:40:14] Preparing to commence simulation
[18:40:14] - Looking at optimizations...
[18:40:14] DeleteFrameFiles: successfully deleted file=work/wudata_04.ckp
[18:40:14] - Created dyn
[18:40:14] - Files status OK
[18:40:14] sizeof(CORE_PACKET_HDR) = 512 file=<>
[18:40:14] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[18:40:14] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[18:40:14] - Digital signature verified
[18:40:14] 
[18:40:14] Project: 6811 (Run 0, Clone 139, Gen 1)
[18:40:14] 
[18:40:14] Assembly optimizations on if available.
[18:40:14] Entering M.D.
[18:40:16] Tpr hash work/wudata_04.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[18:40:16] Working on Protein
[18:40:16] Client config found, loading data.
[18:40:17] Starting GUI Server
[18:45:55] Gpu type=3 species=30.
[18:50:55] Completed    500000 out of 50000000 steps (1%).
[19:01:20] Completed   1000000 out of 50000000 steps (2%).
[19:11:43] Completed   1500000 out of 50000000 steps (3%).
[19:22:11] Completed   2000000 out of 50000000 steps (4%).
[19:32:40] Completed   2500000 out of 50000000 steps (5%).
[19:43:10] Completed   3000000 out of 50000000 steps (6%).
[19:53:19] Completed   3500000 out of 50000000 steps (7%).
[19:53:19] mdrun_gpu returned 52
[19:53:19] NANs detected on GPU
[19:53:19] 
[19:53:19] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:53:22] CoreStatus = 7A (122)
[19:53:22] Sending work to server
[19:53:22] Project: 6811 (Run 0, Clone 139, Gen 1)
[19:53:22] - Error: Could not get length of results file work/wuresults_04.dat
[19:53:22] - Error: Could not read unit 04 file. Removing from queue.
[19:53:22] Trying to send all finished work units
[19:53:22] + No unsent completed units remaining.
[19:53:22] - Preparing to get new work unit...
[19:53:22] Cleaning up work directory
[19:53:22] + Attempting to get work packet
[19:53:22] Passkey found
[19:53:22] - Will indicate memory of 6142 MB
[19:53:22] Gpu type=3 species=30.
[19:53:22] - Connecting to assignment server
[19:53:22] Connecting to http://assign-GPU.stanford.edu:8080/
[19:53:23] Posted data.
[19:53:23] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[19:53:23] + News From Folding@Home: Welcome to Folding@Home
[19:53:23] Loaded queue successfully.
[19:53:23] Gpu type=3 species=30.
[19:53:23] Sent data
[19:53:23] Connecting to http://171.64.65.64:8080/
[19:53:23] Posted data.
[19:53:23] Initial: 0000; - Receiving payload (expected size: 131502)
[19:53:25] - Downloaded at ~64 kB/s
[19:53:25] - Averaged speed for that direction ~81 kB/s
[19:53:25] + Received work.
[19:53:25] Trying to send all finished work units
[19:53:25] + No unsent completed units remaining.
[19:53:25] + Closed connections
[19:53:30] 
[19:53:30] + Processing work unit
[19:53:30] Core required: FahCore_15.exe
[19:53:30] Core found.
[19:53:30] Working on queue slot 05 [November 15 19:53:30 UTC]
[19:53:30] + Working ...
[19:53:30] - Calling '.\FahCore_15.exe -dir work/ -suffix 05 -nice 19 -checkpoint 15 -verbose -lifeline 2968 -version 630'

[19:53:30] 
[19:53:30] *------------------------------*
[19:53:30] Folding@Home GPU Core
[19:53:30] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[19:53:30] 
[19:53:30] Build host: SimbiosNvdWin7
[19:53:30] Board Type: NVIDIA/CUDA
[19:53:30] Core      : x=15
[19:53:30]  Window's signal control handler registered.
[19:53:30] Preparing to commence simulation
[19:53:30] - Looking at optimizations...
[19:53:30] DeleteFrameFiles: successfully deleted file=work/wudata_05.ckp
[19:53:30] - Created dyn
[19:53:30] - Files status OK
[19:53:30] sizeof(CORE_PACKET_HDR) = 512 file=<>
[19:53:30] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[19:53:30] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[19:53:30] - Digital signature verified
[19:53:30] 
[19:53:30] Project: 6811 (Run 0, Clone 139, Gen 1)
[19:53:30] 
[19:53:30] Assembly optimizations on if available.
[19:53:30] Entering M.D.
[19:53:32] Tpr hash work/wudata_05.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[19:53:32] Working on Protein
[19:53:32] Client config found, loading data.
[19:53:32] Starting GUI Server
[20:03:56] Completed    500000 out of 50000000 steps (1%).
[20:14:15] Completed   1000000 out of 50000000 steps (2%).
[20:24:36] Completed   1500000 out of 50000000 steps (3%).
[20:34:57] Completed   2000000 out of 50000000 steps (4%).
[20:45:32] Completed   2500000 out of 50000000 steps (5%).
[20:56:00] Completed   3000000 out of 50000000 steps (6%).
[21:06:34] Completed   3500000 out of 50000000 steps (7%).
[21:17:10] Completed   4000000 out of 50000000 steps (8%).
[21:27:22] Completed   4500000 out of 50000000 steps (9%).
[21:27:22] mdrun_gpu returned 52
[21:27:22] NANs detected on GPU
[21:27:22] 
[21:27:22] Folding@home Core Shutdown: UNSTABLE_MACHINE
[21:27:25] CoreStatus = 7A (122)
[21:27:25] Sending work to server
[21:27:25] Project: 6811 (Run 0, Clone 139, Gen 1)
[21:27:25] - Error: Could not get length of results file work/wuresults_05.dat
[21:27:25] - Error: Could not read unit 05 file. Removing from queue.
[21:27:25] Trying to send all finished work units
[21:27:25] + No unsent completed units remaining.
[21:27:25] - Preparing to get new work unit...
[21:27:25] Cleaning up work directory
[21:27:25] + Attempting to get work packet
[21:27:25] Passkey found
[21:27:25] - Will indicate memory of 6142 MB
[21:27:25] Gpu type=3 species=30.
[21:27:25] - Connecting to assignment server
[21:27:25] Connecting to http://assign-GPU.stanford.edu:8080/
[21:27:26] Posted data.
[21:27:26] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[21:27:26] + News From Folding@Home: Welcome to Folding@Home
[21:27:27] Loaded queue successfully.
[21:27:27] Gpu type=3 species=30.
[21:27:27] Sent data
[21:27:27] Connecting to http://171.64.65.64:8080/
[21:27:28] Posted data.
[21:27:28] Initial: 0000; - Receiving payload (expected size: 131502)
[21:27:32] - Downloaded at ~32 kB/s
[21:27:32] - Averaged speed for that direction ~71 kB/s
[21:27:32] + Received work.
[21:27:32] Trying to send all finished work units
[21:27:32] + No unsent completed units remaining.
[21:27:32] + Closed connections
[21:27:37] 
[21:27:37] + Processing work unit
[21:27:37] Core required: FahCore_15.exe
[21:27:37] Core found.
[21:27:37] Working on queue slot 06 [November 15 21:27:37 UTC]
[21:27:37] + Working ...
[21:27:37] - Calling '.\FahCore_15.exe -dir work/ -suffix 06 -nice 19 -checkpoint 15 -verbose -lifeline 2968 -version 630'

[21:27:37] 
[21:27:37] *------------------------------*
[21:27:37] Folding@Home GPU Core
[21:27:37] Version 2.14 (Thu Nov 11 10:05:53 PST 2010)
[21:27:37] 
[21:27:37] Build host: SimbiosNvdWin7
[21:27:37] Board Type: NVIDIA/CUDA
[21:27:37] Core      : x=15
[21:27:37]  Window's signal control handler registered.
[21:27:37] Preparing to commence simulation
[21:27:37] - Looking at optimizations...
[21:27:37] DeleteFrameFiles: successfully deleted file=work/wudata_06.ckp
[21:27:37] - Created dyn
[21:27:37] - Files status OK
[21:27:37] sizeof(CORE_PACKET_HDR) = 512 file=<>
[21:27:37] - Expanded 130990 -> 541491 (decompressed 413.3 percent)
[21:27:37] Called DecompressByteArray: compressed_data_size=130990 data_size=541491, decompressed_data_size=541491 diff=0
[21:27:37] - Digital signature verified
[21:27:37] 
[21:27:37] Project: 6811 (Run 0, Clone 139, Gen 1)
[21:27:37] 
[21:27:37] Assembly optimizations on if available.
[21:27:37] Entering M.D.
[21:27:39] Tpr hash work/wudata_06.tpr:  1713532816 1773889973 3265068694 1042100616 1966209275
[21:27:39] Working on Protein
[21:27:39] Client config found, loading data.
[21:27:39] Starting GUI Server
[21:38:17] Completed    500000 out of 50000000 steps (1%).
[21:48:51] Completed   1000000 out of 50000000 steps (2%).
[21:59:19] Completed   1500000 out of 50000000 steps (3%).
[22:09:45] Completed   2000000 out of 50000000 steps (4%).
[22:20:15] Completed   2500000 out of 50000000 steps (5%).
[22:30:51] Completed   3000000 out of 50000000 steps (6%).
[22:41:27] Completed   3500000 out of 50000000 steps (7%).
[22:51:56] Completed   4000000 out of 50000000 steps (8%).
[23:00:35] Gpu type=3 species=30.
[23:02:20] Completed   4500000 out of 50000000 steps (9%).
[23:08:30] - Autosending finished units... [November 15 23:08:30 UTC]
[23:08:30] Trying to send all finished work units
[23:08:30] + No unsent completed units remaining.
[23:08:30] - Autosend completed
[23:08:30] + Working...
[23:10:20] Gpu type=3 species=30.
[23:12:41] Completed   5000000 out of 50000000 steps (10%).
[23:20:33] Gpu type=3 species=30.
[23:22:56] Completed   5500000 out of 50000000 steps (11%).
[23:22:56] mdrun_gpu returned 52
[23:22:56] NANs detected on GPU
[23:22:56] 
[23:22:56] Folding@home Core Shutdown: UNSTABLE_MACHINE
[23:23:00] CoreStatus = 7A (122)
[23:23:00] Sending work to server
[23:23:00] Project: 6811 (Run 0, Clone 139, Gen 1)
[23:23:00] - Error: Could not get length of results file work/wuresults_06.dat
[23:23:00] - Error: Could not read unit 06 file. Removing from queue.
[23:23:00] EUE limit exceeded. Pausing 24 hours.
[05:08:30] - Autosending finished units... [November 16 05:08:30 UTC]
[05:08:30] Trying to send all finished work units
[05:08:30] + No unsent completed units remaining.
[05:08:30] - Autosend completed
[05:08:30] + Working...
[07:25:37] Gpu type=3 species=30.
[07:27:53] ***** Got a SIGTERM signal (2)
[07:27:53] Killing all core threads

Folding@Home Client Shutdown.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 6:39 pm
by bruce
NaNs can be caused by several different conditions. The fact that it's happening at different points in the same WU makes it unlikely to be a problem with the WU itself. That leaves "hardware" which may mean (A) too much heat (B) too much overclock or (C) defective hardware.

The most likely seems to be heat. First, I'll assume that the manufacturer who assembled the GPU on a board used the reference design -- a 2-slot board that blows the heat out the back of the computer. Is the air coming out there quite hot? Is the inside of the case quite hot? Are all of the case fans operational?

You said the GPU fan was changing speed. I've seen a number of cases where the fan profile didn't set the fan speed high enough. You may need to do that to keep the board cool enough.

I'm using MSI Afterburner. It's a tool used primarily for overclocking but it also gives you the ability to monitor the temperature and the fan speed and, if necessary, increase the fan speed. There are other tools which provide similar options.

There's also an environment variable FAH_GPU_IDLE which can be set to 5 or 10 which can reduce the percentage of time that FAH folds, thereby reducing the heat, but you probably don't want to use that option permanently if there's a better choice.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 9:49 pm
by shunter
Thanks Bruce.
From what you say it appears to be Hardware. Per GPU-Z its an EVGa (3482) and its the standard design running at default. Air exhausted is hot and again per GPU-Z its running at 98 C and fan at 100%. System is also running a 6701 on the CPU core so I assume thats adding to the heat issue - this will finish at 7.30am tomorrow so will set parameters to include oneunit here to give more scope if the GPU is still running later tonight. In the meantime I have used your FAH-GPU-IDLE suggestion and severly restricted the running time to see if I can get through this unit.

I'll look into MSI afterburner - probably a stupid question but can it be used to tune down a card so that I can run it at a lower speed / temp and complete the unit that way.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Wed Nov 17, 2010 11:26 pm
by bruce
I just checked and FAH_GPU_IDLE doesn't work yet with Fahcore_15 thought it will in some future version of the core. (And it will never work if you call it FAH-GPU-IDLE)

Yes, MSI afterburner can be used do underclock a GPU. 98C is too hot. It shouldn't melt the silicon, but you will get errors.

I'm surprised that EVGa has such poor cooling at a 100% fan setting.

My 460 is currently running a 6800 and it's different enough from the 6811 that you can't really compare performance, but it's a 82C with the default fan profile (currently at 56% fan). We can see if I snag a 6811 in a couple of hours.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Thu Nov 18, 2010 10:22 am
by shunter
Sorry Bruce, that's my bad typing - I'm sure I set it up correctly but if it's not working then it's no use. Funnily enough after that the fan settled down to a steady rhythm and folding ran perfectly overnight. At 7.07 am the CPU core completed and closed down and then at 8.51 am the GPU crashed again (see log below) so am back to square 1.

I'll have to look into the EVGA running too hot but as only been monitoring temp for last day or so I don't know how hot it's been running on the 6800s and need to get inside the box. Just recovering from eye op so that will have to wait until I can see more clearly - probably that and dyslexic fingers also explain bad typing :D :D .

Thanks again
Dave

Code: Select all

[08:51:23] Completed  33499999 out of 50000000 steps (67%).
[08:51:23] mdrun_gpu returned 52
[08:51:23] NANs detected on GPU
[08:51:23] 
[08:51:23] Folding@home Core Shutdown: UNSTABLE_MACHINE
[08:51:27] CoreStatus = 7A (122)
[08:51:27] Sending work to server
[08:51:27] Project: 6811 (Run 0, Clone 139, Gen 1)
[08:51:27] - Error: Could not get length of results file work/wuresults_02.dat
[08:51:27] - Error: Could not read unit 02 file. Removing from queue.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Fri Nov 19, 2010 12:07 am
by HenryW
Yup, that's way too hot and more than likely cause of failure on that WU. Down clocking will reduce temperature but folding time will also increase. I had a 250 GTS last year that was doing exactly the same as yours is and an added side case fan fixed it.
Does your case have a left side panel fan blowing on the VGA? If there's a place to install a fan in the side and none is there I would recommend you put one in blowing on the VGA. You should currently have at least a front 120mm intake and a rear 120mm exhaust in the case in addition to the PSU fan(s). My MSI 460 HAWK Twin Frozr ll @ 875 core clock folding 6800 is 58C/58% fan speed. I just installed this card 2 hrs ago, had a MSI 465 prior to this. Beginning with an MSI 260 OC Twin Frozr, 465 Twin Frozr ll and now the HAWK Twin Frozr none of them have gone over 60C when f@h. As far as I'm concerned that's the best cooler on a card for f@h and I'll never use anything else. There are replacement coolers you can get that look to be similar but not what one would call inexpensive.

Anyway try everything you can to get the temperature down and that should fix your stability problem. Another thing that could be a contributing factor is a weak PSU. Do you have any idea what brand, model and amps on the +12V rail(s) that it has?

Edit: I forgot to add you can get Afterburner here, http://downloads.guru3d.com/Videocards- ... g_c13.html

or here: http://www.msi.com/index.php?func=downl ... pe=utility

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Fri Nov 19, 2010 12:56 am
by bruce
CUDA allows developers to create code that is extremely efficient. If you're asking them to slow down everybody's processing (say by setting the default value of FAH_GPU_IDLE to 3% once they make it work correctly, I don't think that's a realistic request.

Many people are able to run effectively with the current settings. Currently my GTX460 is running a p6811 at stock settings. The default fan profile setting is 70% at 73C. Clock rates are at the default 675/1350/1800 and it shows 100% utilization. (In the past I've had it up to 825/1650/1800 without any errors.)

Perhaps there's a systematic difference between the cooling solutions provided by Asus/eVGA/Galaxy/Gigabyte/MSI/Palit/Sparkle/Zotac (etc?). I do know that the 768MB GDDR cards use less power than the 1GB DDR5 cards. Perhaps it's the VRAM that's generating more heat or the fan that's not moving enough air or the heat-sink that's not doing its job.

If everybody is having the same problem, then maybe FAH should make their code less efficient. (They'd have to tell NVidia that CUDA is too efficient and that it's a generalized problem.) As long as it's a problem that you have and I don't, though, it's going to be something that the company that assembled your GPU board (or maybe just you) is going to have to address, not FAH.

I'd start with a RMA request to the board manufacturer saying the fan isn't providing adequate cooling when running FAH and point them to this topic for documentation.

We should probably start collecting data about brand X compared to brand Y and see if we can identify a systematic cooling problem, but there are enough other variables in any cooling discussion that the data probably won't show particularly clear trends.

EDIT:
I have been reminded of another factor that needs to be taken into account. There must be a good thermal fit between the heatsink and the GPU. Anybody who has worked with computers very much knows that the thermal grease must be high-quality, must have a very thin coat, and if you really want a good fit, maybe you need to lap the CPU and the heat-sink. Then the heatsink mounting must maintain a good fit so the heat gets transferred to the heatsink.

Some manufacturers use really cheap thermal grease, do not apply it sparingly, and certainly do no lapping. If you attempt to fix this sort of problem yourself, you will void your warranty . . . and you may or may not be successful. If you return the board to the manufacturer, they may do it for you or they may replace it with a different board that dissipates heat more effectively.

I have personally re-seated a factory installed CPU-heatsink that was originally installed with thermal compound more that 1mm / 0.40" thick. It's amazing how much cooler it ran after that.

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Fri Nov 19, 2010 10:23 am
by shunter
Henry / Bruce,
Thanks for all the info and advice given which I will be looking into when recovered from op enough to see what I'm doing in the box.

In the end I stopped CPU folding and ran a household fan through the side vents to keep temps down but still ran at between 92 & 96 which is far too high for future units so temporarily have removed advmethods from the flags until I can monitor temps and make changes. Unit eventually finished a few minutes ago and submitted so am now running a 6800 and without any side fan the system is cruising along at 77 - 78 C even with CPU folding a 6701 which seems ok to me.

Thanks again
Dave

Re: Project: 6811 (Run 0 Clone 139 Gen 1) fails due to nans

Posted: Fri Nov 19, 2010 3:16 pm
by kg4icg
I have a simple solution, manually adjust your fan speed to 60 percent or faster and watch the nan problem fade away. Works for me.