Page 1 of 1

Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Wed Jan 02, 2008 11:07 pm
by Nonymoussurfer
Greetings all.

I got three consecutive EUEs on this one, though not always at the same spot. This machine is a Q6600 G0 running @ 3.2GHz on Abit IP35-V w/Hyper TX2 cpu cooler. Temps are < 60C on all cores. After the first EUE, I lowered from 3.3 to 3.2 GHz & bumped up Vcore a bit. Memory now @ 700MHz (5% OC from 667). This machine has run 2653s 24/7 for several weeks with no EUEs. I got greedy & loaded a 2nd SMP instance w/affinity changer & wouldn't you know it: the first WU it gets EUE's. FWIW, This also may be the first 2652 WU I've gotten since I started with SMP (on 3 machines) a few weeks ago. From what I've read on the forums, 2652 really stresses the CPU & memory. Perhaps, this is a hardware problem? Now it has two 2653s & seems happy. Here's the log. Let me know if there's anything you'd like me to check.

Regards

Code: Select all

--- Opening Log file [January 2 14:33:43] 
# SMP Client ##################################################################
###############################################################################
                       Folding@Home Client Version 5.91beta5
                          http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Program Files\Folding@Home\SMP Client V1.01_2
Executable: C:\Program Files\Folding@Home\SMP Client V1.01_2\fah.exe
Arguments: -local -advmethods -verbosity 9 -oneunit 
[14:33:43] - Ask before connecting: No
[14:33:43] - User name: Christopher_Schweizer (Team 67221)
[14:33:43] - User ID: 57F28CD166E33145
[14:33:43] - Machine ID: 2
[14:33:43] 
[14:33:43] Work directory not found. Creating...
[14:33:43] Could not open work queue, generating new queue...
[14:33:43] - Autosending finished units...
[14:33:43] - Preparing to get new work unit...
[14:33:43] Trying to send all finished work units
[14:33:43] + Attempting to get work packet
[14:33:43] + No unsent completed units remaining.
[14:33:43] - Will indicate memory of 1024 MB
[14:33:43] - Autosend completed
[14:33:43] - Connecting to assignment server
[14:33:43] Connecting to http://assign.stanford.edu:8080/
[14:33:44] Posted data.
[14:33:44] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[14:33:44] + News From Folding@Home: Welcome to Folding@Home
[14:33:44] Loaded queue successfully.
[14:33:44] Connecting to http://171.64.65.64:8080/
[14:33:45] Posted data.
[14:33:45] Initial: 0000; - Receiving payload (expected size: 1147187)
[14:33:49] - Downloaded at ~280 kB/s
[14:33:49] - Averaged speed for that direction ~280 kB/s
[14:33:49] + Received work.
[14:33:49] + Closed connections
[14:33:49] 
[14:33:49] + Processing work unit
[14:33:49] Core required: FahCore_a1.exe
[14:33:49] Core not found.
[14:33:49] - Core is not present or corrupted.
[14:33:49] - Attempting to download new core...
[14:33:49] + Downloading new core: FahCore_a1.exe
[14:33:49] Downloading core (/~pande/Win32/x86//Core_a1.fah from http://www.stanford.edu)
[14:33:50] Initial: AFDE; + 10240 bytes downloaded
~~~
[14:33:53] Initial: D2E9; + 789667 bytes downloaded
[14:33:53] Verifying core Core_a1.fah...
[14:33:53] Signature is VALID
[14:33:53] 
[14:33:53] Trying to unzip core FahCore_a1.exe
[14:33:53] Decompressed FahCore_a1.exe (2035712 bytes) successfully
[14:33:53] + Core successfully engaged
[14:33:58] 
[14:33:58] + Processing work unit
[14:33:58] Core required: FahCore_a1.exe
[14:33:58] Core found.
[14:33:58] Working on Unit 01 [January 2 14:33:58]
[14:33:58] + Working ...
[14:33:58] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 5 -verbose -lifeline 4036 -version 591'
[14:33:58] 
[14:33:58] *------------------------------*
[14:33:58] Folding@Home Gromacs SMP Core
[14:33:58] Version 1.74 (March 10, 2007)
[14:33:58] 
[14:33:58] Preparing to commence simulation
[14:33:58] - Looking at optimizations...
[14:33:58] .
[14:33:58] - Starting from initial work packet
[14:33:58] 
[14:33:58] Project: 2652 (Run 0, Clone 573, Gen 37)
[14:33:58] 
[14:33:58] Assembly optimizations on if available.
[14:33:58] Entering M.D.
[14:34:16] al work pa- Starting from initial work packet
[14:34:16] 
[14:34:16] Project: 26Entering M.D.
[14:34:16] ne 573, Gen 37)
[14:34:16] 
[14:34:16] Entering M.D.
[14:34:22] Rejecting checkpoint
[14:34:22]  OK.
[14:34:22] in: Protein
[14:34:22] Writing local files
[14:34:23] Extra SSE boost OK.
[14:34:23] Writing local files
[14:34:23] Completed 0 out of 1000000 steps  (0 percent)
[14:40:16] Timered checkpoint triggered.
[14:46:16] Timered checkpoint triggered.
[14:49:45] Writing local files
[14:49:45] Completed 10000 out of 1000000 steps  (1 percent)
[14:55:45] Timered checkpoint triggered.
[15:01:45] Timered checkpoint triggered.
[15:01:49] Writing local files
[15:01:49] Completed 20000 out of 1000000 steps  (2 percent)
[15:07:45] Timered checkpoint triggered.
[15:13:45] Timered checkpoint triggered.
[15:13:54] Writing local files
[15:13:54] Completed 30000 out of 1000000 steps  (3 percent)
[15:19:45] Timered checkpoint triggered.
[15:25:45] Timered checkpoint triggered.
[15:25:58] Writing local files
[15:25:58] Completed 40000 out of 1000000 steps  (4 percent)
[15:31:45] Timered checkpoint triggered.
[15:37:45] Timered checkpoint triggered.
[15:38:00] Writing local files
[15:38:00] Completed 50000 out of 1000000 steps  (5 percent)
[15:43:45] Timered checkpoint triggered.
[15:49:45] Timered checkpoint triggered.
[15:50:05] Writing local files
[15:50:05] Completed 60000 out of 1000000 steps  (6 percent)
[15:55:45] Timered checkpoint triggered.
[16:01:22] Warning:  long 1-4 interactions
[16:01:23] Quit 101 - NaN detected: (ener[0])
[16:01:23] 
[16:01:23] Simulation instability has been encountered. The run has entered a
[16:01:23]   state from which no further progress can be made.
[16:01:23] This may be the correct result of the simulation, however if you
[16:01:23]   often see other project units terminating early like this
[16:01:23]   too, you may wish to check the stability of your computer (issues
[16:01:23]   such as high temperature, overclocking, etc.).
[16:01:23] Going to send back what have done.
[16:01:23] logfile size: 48806
[16:01:23] - Writing 49355 bytes of core data to disk...
[16:01:23]   ... Done.
[16:01:23] No C.P. to delete.
[16:01:23] - Failed to delete work/wudata_01.dyn
[16:01:23] - Failed to delete work/wudata_01.chk
[16:01:23] - Failed to delete work/wudata_01.xvg
[16:01:23] Warning:  check for stray files
[16:03:23] 
[16:03:23] Folding@home Core Shutdown: EARLY_UNIT_END
[16:03:23] 
[16:03:23] Folding@home Core Shutdown: EARLY_UNIT_END
[16:03:26] CoreStatus = 7B (123)
[16:03:26] Client-core communications error: ERROR 0x7b
[16:03:26] Deleting current work unit & continuing...
[16:05:30] - Warning: Could not delete all work unit files (1): Core returned invalid code
[16:05:30] Trying to send all finished work units
[16:05:30] + No unsent completed units remaining.
[16:05:30] - Preparing to get new work unit...
[16:05:30] + Attempting to get work packet
[16:05:30] - Will indicate memory of 1024 MB
[16:05:30] - Connecting to assignment server
[16:05:30] Connecting to http://assign.stanford.edu:8080/
[16:05:31] Posted data.
[16:05:31] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[16:05:31] + News From Folding@Home: Welcome to Folding@Home
[16:05:31] Loaded queue successfully.
[16:05:31] Connecting to http://171.64.65.64:8080/
[16:05:32] Posted data.
[16:05:32] Initial: 0000; - Receiving payload (expected size: 1147187)
[16:05:37] - Downloaded at ~224 kB/s
[16:05:37] - Averaged speed for that direction ~252 kB/s
[16:05:37] + Received work.
[16:05:37] + Closed connections
[16:05:42] 
[16:05:42] + Processing work unit
[16:05:42] Core required: FahCore_a1.exe
[16:05:42] Core found.
[16:05:42] Working on Unit 02 [January 2 16:05:42]
[16:05:42] + Working ...
[16:05:42] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 5 -verbose -lifeline 4036 -version 591'
[16:05:43] 
[16:05:43] *------------------------------*
[16:05:43] Folding@Home Gromacs SMP Core
[16:05:43] Version 1.74 (March 10, 2007)
[16:05:43] 
[16:05:43] Preparing to commence simulation
[16:05:43] - Ensuring status. Please wait.
[16:05:44] - Starting from initial work packet
[16:05:44] 
[16:05:44] Project: 2652 (Run 0, Clone 573, Gen 37)
[16:05:44] 
[16:05:44] Assembly optimizations on if available.
[16:05:44] Entering M.D.
[16:06:01] al work packet
[16:06:01] 
[16:06:01] Project: 2652 (Run 0, Clone 573, Gen 37)
[16:06:01] 
[16:06:01] Entering M.D.
[16:06:02] ne 573, Gen 37)
[16:06:02] 
[16:06:02] g from initial work packet
[16:06:02] 
[16:06:02] Project: 2652 (Run 0, Clone 573, Gen 37)
[16:06:02] 
[16:06:02] Entering M.D.
[16:06:09] Protein: Protein
[16:06:09] Writing local files
[16:06:09] Extra SSE boost OK.
[16:12:01] oint triggered.
[16:17:54] Writing local files
[16:17:54] Completed 10000 out of 1000000 steps  (1 percent)
[16:23:53] Timered checkpoint triggered.
[16:27:05] Killing all core threads
[16:27:05] Killing SMP core threads
[16:27:05] Killing 3 cores
[16:27:05] Killing core 0
[16:27:05] Killing core 1
[16:27:05] Killing core 2
Folding@Home Client Shutdown at user request.
[16:27:05] ***** Got a SIGTERM signal (2)
[16:27:05] Killing all core threads
[16:27:05] Killing SMP core threads
[16:27:05] Killing 3 cores
[16:27:05] Killing core 0
[16:27:05] Killing core 1
[16:27:05] Killing core 2
Folding@Home Client Shutdown.
--- Opening Log file [January 2 16:34:42] 
# SMP Client ##################################################################
###############################################################################
                       Folding@Home Client Version 5.91beta5
                          http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Program Files\Folding@Home\SMP Client V1.01_2
Executable: C:\Program Files\Folding@Home\SMP Client V1.01_2\fah.exe
Arguments: -local -advmethods -verbosity 9 
[16:34:42] - Ask before connecting: No
[16:34:42] - User name: Christopher_Schweizer (Team 67221)
[16:34:42] - User ID: 57F28CD166E33145
[16:34:42] - Machine ID: 2
[16:34:42] 
[16:34:42] Loaded queue successfully.
[16:34:42] 
[16:34:42] - Autosending finished units...
[16:34:42] + Processing work unit
[16:34:42] Trying to send all finished work units
[16:34:42] Core required: FahCore_a1.exe
[16:34:42] + No unsent completed units remaining.
[16:34:42] - Autosend completed
[16:34:42] Core found.
[16:34:42] Working on Unit 02 [January 2 16:34:42]
[16:34:42] + Working ...
[16:34:42] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 02 -checkpoint 5 -verbose -lifeline 1180 -version 591'
[16:34:42] 
[16:34:42] *------------------------------*
[16:34:42] Folding@Home Gromacs SMP Core
[16:34:42] Version 1.74 (March 10, 2007)
[16:34:42] 
[16:34:42] Preparing to commence simulation
[16:34:42] - Ensuring status. Please wait.
[16:34:59] - Looking at optimizations...
[16:34:59] - Working with standard loops on this execution.
[16:35:00] - Previous termination of core was improper.
[16:35:00] - Going to use standard loops.
[16:35:00] - Files status OK
[16:35:02] - Expanded 1146675 -> 5811661 (decompressed 506.8 percent)
[16:35:02] 
[16:35:02] Project: 2652 (Run 0, Clone 573, Gen 37)
[16:35:02] 
[16:35:05] Entering M.D.
[16:35:12] Protein: Protein
[16:35:12] Writing local files
[16:35:12] ing from checkpoint)
[16:35:12] Read checkpoint
[16:35:12] Protein: Protein
[16:35:12] a SSE boost OK.
[16:35:12] les
[16:35:12] Completed 14936 out of 1000000 steps  (1 percent)
[16:35:12] Extra SSE boost OK.
[16:41:13] Timered checkpoint triggered.
[16:43:51] Writing local files
[16:43:51] Completed 20000 out of 1000000 steps  (2 percent)
[16:49:52] Timered checkpoint triggered.
[16:55:52] Timered checkpoint triggered.
[16:56:17] Writing local files
[16:56:17] Completed 30000 out of 1000000 steps  (3 percent)
[17:01:52] Timered checkpoint triggered.
[17:07:52] Timered checkpoint triggered.
[17:08:45] Writing local files
[17:08:45] Completed 40000 out of 1000000 steps  (4 percent)
[17:13:52] Timered checkpoint triggered.
[17:19:52] Timered checkpoint triggered.
[17:21:13] Writing local files
[17:21:13] Completed 50000 out of 1000000 steps  (5 percent)
[17:27:14] Timered checkpoint triggered.
[17:33:14] Timered checkpoint triggered.
[17:33:40] Writing local files
[17:33:40] Completed 60000 out of 1000000 steps  (6 percent)
[17:39:14] Timered checkpoint triggered.
[17:45:14] Timered checkpoint triggered.
[17:46:07] Writing local files
[17:46:07] Completed 70000 out of 1000000 steps  (7 percent)
[17:51:14] Timered checkpoint triggered.
[17:57:14] Timered checkpoint triggered.
[17:58:35] Writing local files
[17:58:35] Completed 80000 out of 1000000 steps  (8 percent)
[18:04:36] Timered checkpoint triggered.
[18:10:36] Timered checkpoint triggered.
[18:11:01] Writing local files
[18:11:01] Completed 90000 out of 1000000 steps  (9 percent)
[18:16:36] Timered checkpoint triggered.
[18:22:36] Timered checkpoint triggered.
[18:23:28] Writing local files
[18:23:28] Completed 100000 out of 1000000 steps  (10 percent)
[18:28:36] Timered checkpoint triggered.
[18:34:36] Timered checkpoint triggered.
[18:35:56] Writing local files
[18:35:56] Completed 110000 out of 1000000 steps  (11 percent)
[18:41:57] Timered checkpoint triggered.
[18:47:57] Timered checkpoint triggered.
[18:48:22] Writing local files
[18:48:22] Completed 120000 out of 1000000 steps  (12 percent)
[18:53:57] Timered checkpoint triggered.
[18:59:56] Timered checkpoint triggered.
[19:00:50] Writing local files
[19:00:50] Completed 130000 out of 1000000 steps  (13 percent)
[19:05:57] Timered checkpoint triggered.
[19:11:57] Timered checkpoint triggered.
[19:13:17] Writing local files
[19:13:17] Completed 140000 out of 1000000 steps  (14 percent)
[19:19:18] Timered checkpoint triggered.
[19:25:18] Timered checkpoint triggered.
[19:25:45] Writing local files
[19:25:45] Completed 150000 out of 1000000 steps  (15 percent)
[19:31:18] Timered checkpoint triggered.
[19:37:18] Timered checkpoint triggered.
[19:38:13] Writing local files
[19:38:13] Completed 160000 out of 1000000 steps  (16 percent)
[19:43:18] Timered checkpoint triggered.
[19:49:18] Timered checkpoint triggered.
[19:50:41] Writing local files
[19:50:41] Completed 170000 out of 1000000 steps  (17 percent)
[19:56:41] Timered checkpoint triggered.
[20:02:42] Timered checkpoint triggered.
[20:03:08] Writing local files
[20:03:08] Completed 180000 out of 1000000 steps  (18 percent)
[20:04:53] Warning:  long 1-4 interactions
[20:04:53] Quit 101 - NaN detected: (ener[20])
[20:04:53] 
[20:04:53] Simulation instability has been encountered. The run has entered a
[20:04:53]   state from which no further progress can be made.
[20:04:53] This may be the correct result of the simulation, however if you
[20:04:53]   often see other project units terminating early like this
[20:04:53]   too, you may wish to check the stability of your computer (issues
[20:04:53]   such as high temperature, overclocking, etc.).
[20:04:53] Going to send back what have done.
[20:04:53] logfile size: 124064
[20:04:53] - Writing 124614 bytes of core data to disk...
[20:04:53]   ... Done.
[20:06:53] 
[20:06:53] Folding@home Core Shutdown: EARLY_UNIT_END
[20:06:53] 
[20:06:53] Folding@home Core Shutdown: EARLY_UNIT_END
[20:06:57] CoreStatus = 7B (123)
[20:06:57] Client-core communications error: ERROR 0x7b
[20:06:57] Deleting current work unit & continuing...
[20:09:01] - Warning: Could not delete all work unit files (2): Core returned invalid code
[20:09:01] Trying to send all finished work units
[20:09:01] + No unsent completed units remaining.
[20:09:01] - Preparing to get new work unit...
[20:09:01] + Attempting to get work packet
[20:09:01] - Will indicate memory of 1024 MB
[20:09:01] - Connecting to assignment server
[20:09:01] Connecting to http://assign.stanford.edu:8080/
[20:09:01] Posted data.
[20:09:01] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[20:09:01] + News From Folding@Home: Welcome to Folding@Home
[20:09:01] Loaded queue successfully.
[20:09:01] Connecting to http://171.64.65.64:8080/
[20:09:03] Posted data.
[20:09:03] Initial: 0000; - Receiving payload (expected size: 1147187)
[20:09:07] - Downloaded at ~280 kB/s
[20:09:07] - Averaged speed for that direction ~261 kB/s
[20:09:07] + Received work.
[20:09:07] + Closed connections
[20:09:12] 
[20:09:12] + Processing work unit
[20:09:12] Core required: FahCore_a1.exe
[20:09:12] Core found.
[20:09:12] Working on Unit 03 [January 2 20:09:12]
[20:09:12] + Working ...
[20:09:12] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 03 -checkpoint 5 -verbose -lifeline 1180 -version 591'
[20:09:12] 
[20:09:12] *------------------------------*
[20:09:12] Folding@Home Gromacs SMP Core
[20:09:12] Version 1.74 (March 10, 2007)
[20:09:12] 
[20:09:12] Preparing to commence simulation
[20:09:12] - Ensuring status. Please wait.
[20:09:12] - Starting from initial work packet
[20:09:12] 
[20:09:12] Project: 2652 (Run 0, Clone 573, Gen 37)
[20:09:12] 
[20:09:12] Assembly optimizations on if available.
[20:09:12] Entering M.D.
[20:09:30] al work pa- Starting from initial work packet
[20:09:30] 
[20:09:30] Project: 26Entering M.D.
[20:09:30] ne 573, Gen 37)
[20:09:30] 
[20:09:30] Entering M.D.
[20:09:36] Rejecting checkpoint
[20:09:36]  OK.
[20:09:36] in: Protein
[20:09:36] Writing local files
[20:09:36] Extra SSE boost OK.
[20:09:37] Writing local files
[20:09:37] Completed 0 out of 1000000 steps  (0 percent)
[20:15:30] Timered checkpoint triggered.
[20:21:30] Timered checkpoint triggered.
[20:21:57] Writing local files
[20:21:57] Completed 10000 out of 1000000 steps  (1 percent)
[20:27:30] Timered checkpoint triggered.
[20:33:30] Timered checkpoint triggered.
[20:34:18] Writing local files
[20:34:18] Completed 20000 out of 1000000 steps  (2 percent)
[20:39:30] Timered checkpoint triggered.
[20:45:30] Timered checkpoint triggered.
[20:46:40] Writing local files
[20:46:40] Completed 30000 out of 1000000 steps  (3 percent)
[20:52:40] Timered checkpoint triggered.
[20:58:40] Timered checkpoint triggered.
[20:59:01] Writing local files
[20:59:01] Completed 40000 out of 1000000 steps  (4 percent)
[21:04:40] Timered checkpoint triggered.
[21:10:40] Timered checkpoint triggered.
[21:11:21] Writing local files
[21:11:21] Completed 50000 out of 1000000 steps  (5 percent)
[21:16:40] Timered checkpoint triggered.
[21:22:40] Timered checkpoint triggered.
[21:23:43] Writing local files
[21:23:43] Completed 60000 out of 1000000 steps  (6 percent)
[21:29:42] Timered checkpoint triggered.
[21:35:04] Warning:  long 1-4 interactions
[21:35:04] Quit 101 - NaN detected: (ener[0])
[21:35:04] 
[21:35:04] Simulation instability has been encountered. The run has entered a
[21:35:04]   state from which no further progress can be made.
[21:35:04] This may be the correct result of the simulation, however if you
[21:35:04]   often see other project units terminating early like this
[21:35:04]   too, you may wish to check the stability of your computer (issues
[21:35:04]   such as high temperature, overclocking, etc.).
[21:35:04] Going to send back what have done.
[21:35:04] logfile size: 48806
[21:35:04] - Writing 49355 bytes of core data to disk...
[21:35:04]   ... Done.
[21:35:04] - Failed to delete work/wudata_03.sas
[21:35:04] - Failed to delete work/wudata_03.goe
[21:35:04] Warning:  check for stray files
[21:35:04] 
[21:35:04] Folding@home Core Shutdown: EARLY_UNIT_END
[21:35:04] 
[21:35:04] Folding@home Core Shutdown: EARLY_UNIT_END
[21:35:08] CoreStatus = 7B (123)
[21:35:08] Client-core communications error: ERROR 0x7b
[21:35:08] Deleting current work unit & continuing...
[21:37:12] - Warning: Could not delete all work unit files (3): Core returned invalid code
[21:37:12] Trying to send all finished work units
[21:37:12] + No unsent completed units remaining.
[21:37:12] - Preparing to get new work unit...
[21:37:12] + Attempting to get work packet
[21:37:12] - Will indicate memory of 1024 MB
[21:37:12] - Connecting to assignment server
[21:37:12] Connecting to http://assign.stanford.edu:8080/
[21:37:13] Posted data.
[21:37:13] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[21:37:13] + News From Folding@Home: Welcome to Folding@Home
[21:37:13] Loaded queue successfully.
[21:37:13] Connecting to http://171.64.65.64:8080/
[21:37:13] Posted data.
[21:37:13] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[21:37:13] - Error: Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[21:37:31] + Attempting to get work packet
[21:37:31] - Will indicate memory of 1024 MB
[21:37:31] - Connecting to assignment server
[21:37:31] Connecting to http://assign.stanford.edu:8080/
[21:37:31] Posted data.
[21:37:31] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[21:37:31] + News From Folding@Home: Welcome to Folding@Home
[21:37:32] Loaded queue successfully.
[21:37:32] Connecting to http://171.64.65.64:8080/
[21:37:35] Posted data.
[21:37:35] Initial: 0000; - Receiving payload (expected size: 2963890)
[21:37:45] - Downloaded at ~289 kB/s
[21:37:45] - Averaged speed for that direction ~268 kB/s
[21:37:45] + Received work.
[21:37:45] + Closed connections
[21:37:50] 
[21:37:50] + Processing work unit
[21:37:50] Core required: FahCore_a1.exe
[21:37:50] Core found.
[21:37:50] Working on Unit 04 [January 2 21:37:50]
[21:37:50] + Working ...
[21:37:50] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 04 -checkpoint 5 -verbose -lifeline 1180 -version 591'
[21:37:50] 
[21:37:50] *------------------------------*
[21:37:50] Folding@Home Gromacs SMP Core
[21:37:50] Version 1.74 (March 10, 2007)
[21:37:50] 
[21:37:50] Preparing to commence simulation
[21:37:50] - Ensuring status. Please wait.
[21:37:52] - Starting from initial work packet
[21:37:52] 
[21:37:52] Project: 2653 (Run 20, Clone 137, Gen 34)
[21:37:52] 
[21:37:52] Assembly optimizations on if available.
[21:37:52] Entering M.D.
[21:38:11] l work packet
[21:38:11] 
[21:38:11] Project: 2653 (Run 20, Clone 137, Gen 34)
[21:38:11] 
[21:38:12] 3 (Run 20, Clone 137, Gen 34)
[21:38:12] 
[21:38:12] Entering M.D.
[21:38:18] Rejecting checkpoint
[21:38:19] Protein: Protein in POPC
[21:38:19] Writing local files
[21:38:20] Extra SSE boost OK.
[21:38:20] Writing local files
[21:38:20] Completed 0 out of 500000 steps  (0 percent)
[21:43:20] Timered checkpoint triggered.
[21:48:20] Timered checkpoint triggered.
[21:50:41] Writing local files
[21:50:41] Completed 5000 out of 500000 steps  (1 percent)
[21:55:41] Timered checkpoint triggered.
[22:00:41] Timered checkpoint triggered.
[22:03:08] Writing local files
[22:03:08] Completed 10000 out of 500000 steps  (2 percent)
[22:08:07] Timered checkpoint triggered.
[22:13:07] Timered checkpoint triggered.
[22:15:33] Writing local files
[22:15:33] Completed 15000 out of 500000 steps  (3 percent)
[22:20:33] Timered checkpoint triggered.
[22:25:33] Timered checkpoint triggered.
[22:27:58] Writing local files
[22:27:58] Completed 20000 out of 500000 steps  (4 percent)
[22:32:57] Timered checkpoint triggered.
[22:34:42] - Autosending finished units...
[22:34:42] Trying to send all finished work units
[22:34:42] + No unsent completed units remaining.
[22:34:42] - Autosend completed
[22:37:57] Timered checkpoint triggered.
[22:40:23] Writing local files
[22:40:24] Completed 25000 out of 500000 steps  (5 percent)

Re: 2652 (Run 0, Clone 573, Gen 37)

Posted: Thu Jan 03, 2008 5:46 am
by 7im
Repeatative NaN errors have always turned out to be hardware related in the past. Either to much overclocking, too high of temps, bad memory or memory timings set too aggresively, hard drive going bad, etc. And because the WU didn't error out in the same place also points to a cause other than a bad WU.

Re: Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Fri Jan 04, 2008 2:54 am
by Cajun_Don
I got greedy & loaded a 2nd SMP instance w/affinity changer & wouldn't you know it: the first WU it gets EUE's.


You answered your owned problem. You got greedy and loaded two instances of the SMP client. You are hurting the science by doing that, and it is not helping you, or the Folding Project.

Re: Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Sun Jan 06, 2008 6:42 pm
by Nonymoussurfer
Thanks Don. I appreciate your comments although I respectfully disagree.

There are really two separate issues here. I can accept that my PC config (i.e. overclocked RAM or CPU) may have caused this problem. I don't accept, however, that it's because of running two SMP clients. I can (and did) back off the overclock, and I continue to monitor to verify stability. I would like to point out that there are tons of reports of 2652 EUEs, many on OC'd dual/quad core machines, many on non-OC'd dual/quad machines. My machine has been running two SMP instances 24/7 (getting strictly 2653s) without incident since I posted about the 2652 EUE. Also I should note that I have not received another 2652 either. I checked my stats & only one 2652 has been reported. I believe my one 2652 is the partial result returned from the EUE'd unit.

I understand Stanford's policy is that you should run only one SMP client per 4 processor cores. However, I fail to see how it hurts the science if the preferred deadlines are met. Running two SMP clients, this machine finishes one 2653 approx every 19-20 hours. My X3210 w/two SMPs finishes one 2653 approx every 24 hours (or if one SMP every 17 hrs). Given the preferred deadline is 3 days, how is it better for science to run one SMP client that returns one 2653 approx every 15 hours, than two SMP clients that return two 2653s every 20 hours? Seems to me that if you can meet the preferred deadline with 2 days to spare, the argument is irrelevent. My E6300 dual core machine also gets 2653s & finishes in 19-20 hours. If you run Affinity Changer on a Q6600, you essentially turn it into two E6600s for Folding purposes (at least the frame times are comparable), excepting a few hiccups such as starting/stopping the client. So what's the difference?
You are hurting the science by doing that, and it is not helping you, or the Folding Project.
I've seen this particular statement many times in various threads on this forum, but I have not seen a good explanation of why that is true if your hardware can meet the preferred deadlines (with a little pad).

Policy is typically written with a broad swath (it has to be). If Stanford were to prohibit dual core machines from running SMP so that only quad core CPUs were running SMP, then I agree that my quad running dual SMP clients would then be the slowest of the bunch & would delay the use of the completed dataset until the slowpoke returns its work. Since there are Pentium Ds & Core 2 Duos running SMP, that is simply not the case.

I suspect the reason for Stanford's position on this subject is to try & help the project by discouraging us from running SMP on old machines with insufficient hardware that would run too close to the deadline (i.e. P4 w/HT). If the machine takes 60 hours to finish the WU & a server glitch prevents it from returning the work by a half day, you are likely to miss the preferred deadline. In this case I agree it hurts the science.

Perhaps there is some unintended interaction between two SMP clients that increases the likelihood of failure? This I could accept. Can someone kindly point me to the discussion thread on this subject since it has probably already been discussed ad nauseam. Probably on the old forum...

I'm not saying there isn't a good basis for this policy, but I haven't seen anyone challenge it (besides the 10,000 other people running two SMP clients for an explanation... Is anyone able to provide a detailed explanation why it is "bad for the science" to run two SMP instances on a quad. I want to do the right thing, but I'm not convinced I'm wrong. Provide a valid explanation and I will happily shut down the 2nd SMP instance on my quad rigs.

Re: Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Sun Jan 06, 2008 11:21 pm
by bruce
First, a suggestion. If you'd like help in finding out what happened with the 2652, it's helpful to know what name you use for folding. You might want to explain that in your signature since "Nonymoussurfer" does not appear in the stats and there's nothing else mentioned in your post.
Nonymoussurfer wrote:Thanks Don. I appreciate your comments although I respectfully disagree.
. . .
I understand Stanford's policy is that you should run only one SMP client per 4 processor cores. However, I fail to see how it hurts the science if the preferred deadlines are met. . . .
I wouldn't call that "respectfully disagreeing" to claim that you know more about the needs of Science than Vijay Pande, the Director of the FAH project who originally expressed his concerns about running two clients on a HyperThreaded P4 computer. Yes, this topic has been repeatedly discussed at length in the old forum and we won't be repeating all of that discussion over and over and over. Please do not reply, but consider the following:

Yes, getting too close to the deadline is something to be concerned about, but when returning WUs, faster is always better than slower. If a project requires 300 generations and the preferred deadline is 3 days, using your philosophy, it's fine if the project takes about 30 months to finish (considering that some WUs will be lost and need to be reassigned). If 90% of the people actually return the results in 1 day, the project will be finished in about 12 months. Would you rather pay scientist to sit an wait for 30 months before starting to write up is results or pay him to write his conclusions after only 12 months wait?

I suspect that this may be the WU you were asking about. It received full credit and was not an EUE.
Hi Christopher_Schweizer (team 67221),
Your WU (P2652 R0 C251 G34) was added to the stats database on 2007-11-21 12:21:17 for 1148 points of credit.

Re: Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Mon Jan 07, 2008 2:57 pm
by Nonymoussurfer
I apologize if I came off the wrong way, but the forum does not have a thread addressing this subject. I prefer trying to find answers for myself to avoid looking like a complete noob but didn't find a good answer to this. Its all good and well that this was discussed this to death on the old forum, but that info is no longer available. Since I couldn't find a thread on the new forum, I thought I'd stick my neck out and raise the topic for discussion. I figure that since I am questioning this (being fairly new to SMP), there are probably other people with the same concerns. It feels like I am getting berated for asking a question.

Folks closer to the project (and the science behind it) probably have a lot of knowledge that is obvious to them, but may not be obvious to others. I have made no pretense of offering anything new or unique to the cause and certainly don't claim to be a folding expert, far from it... I absolutely do not want to act against the interest of the project or the wishes of Dr. Pande. Like many others, I have been personally affected by Alzheimer's. When struggling with the realities of the disease, Folding was the only thing I found that I could do about it. To me Folding is an investment in my future, since I may not know who the hell I am in 20 years unless science/modern medicine can come up with an effective way to treat it. My goal is to maximize my folding contribution. You provided a reasonable explanation, and I will not spend any more time trying to find fault with it.

You got me. I am Christopher Schweizer. Please don't ruin my credit.

Re: Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Mon Jan 07, 2008 11:17 pm
by bruce
Nonymoussurfer wrote:It feels like I am getting berated for asking a question.
I'm sorry if I came on so strong. You're certainly entitled to an answer. I'm a bit sensitive to this issue because we had a number of discussions of the same issues and many people brought an bit of "attitude" that must have rubbed off on me. That doesn't change the facts so accept my apology for berating you.
My goal is to maximize my folding contribution. You provided a reasonable explanation, and I will not spend any more time trying to find fault with it.
If there's fault to be found, it's with the points system which doesn't have a way to align the number of points that are awarded with the scientific value.
You got me. I am Christopher Schweizer. Please don't ruin my credit.
I couldn't -- and wouldn't. I was just giving you the information you asked for.

Re: Project: 2652 (Run 0, Clone 573, Gen 37) [NaN error]

Posted: Tue Jan 08, 2008 3:03 am
by Nonymoussurfer
No apology necessary. You folks no doubt answer the same questions over and over and over again... I can see how that could tend to put one on edge. Thanks for taking time to try & answer everyone's random questions every day.
Please don't ruin my credit.
I couldn't -- and wouldn't. I was just giving you the information you asked for.
And I was just kidding