Page 1 of 1

Project: 3867 (Run 1685, Clone 0, Gen 3) Early Unit End

Posted: Sat Jun 18, 2011 10:59 pm
by karlinhigh
Project: 3867 (Run 1685, Clone 0, Gen 3)

Early unit end at 66%, after host machine restarted for Windows updates; I've read that core A6 doesn't like that. But it looks like it did run for 13 seconds after restarting. The host machine is a Dell Latitude D620 laptop, Core 2 Duo T5500 1.66 GHz, 1 GB RAM, Windows XP Pro SP3. It's finished A6 work units before, so I know it can handle them.

PS - I noticed my machine also ran Gen 1 and Gen 2 of this project/run/clone; I was feeling good about it doing the whole thing by itself. But then this happened...

Code: Select all

# Windows CPU Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Documents and Settings\user\FAH\1
Executable: fah6
Arguments: -verbosity 9 

[18:08:21] - Ask before connecting: No
[18:08:21] - User name: desktopsvcs.com (Team 35054)
[18:08:21] - User ID: 7ACBEEAB1C1F35DA
[18:08:21] - Machine ID: 1

### Skip to relevant section ###

[18:25:41] - Preparing to get new work unit...
[18:25:41] + Attempting to get work packet
[18:25:41] - Will indicate memory of 1014 MB
[18:25:41] - Connecting to assignment server
[18:25:41] Connecting to http://assign.stanford.edu:8080/
[18:25:42] Posted data.
[18:25:42] Initial: 8F80; - Successful: assigned to (128.143.48.226).
[18:25:42] + News From Folding@Home: Welcome to Folding@Home
[18:25:42] Loaded queue successfully.
[18:25:42] Connecting to http://128.143.48.226:8080/
[18:25:42] Posted data.
[18:25:42] Initial: 0000; - Receiving payload (expected size: 1489009)
[18:25:45] - Downloaded at ~484 kB/s
[18:25:45] - Averaged speed for that direction ~253 kB/s
[18:25:45] + Received work.
[18:25:45] Trying to send all finished work units
[18:25:45] + No unsent completed units remaining.
[18:25:45] + Closed connections
[18:25:45] 
[18:25:45] + Processing work unit
[18:25:45] Core required: FahCore_a6.exe
[18:25:45] Core found.
[18:25:45] Working on queue slot 00 [June 15 18:25:45 UTC]
[18:25:45] + Working ...
[18:25:45] - Calling '.\FahCore_a6.exe -dir work/ -suffix 00 -nocpulock -checkpoint 15 -verbose -lifeline 3800 -version 623'

[18:25:45] 
[18:25:45] *------------------------------*
[18:25:45] Folding@Home Gromacs Core
[18:25:45] Version 2.28 (Wed Mar 23 13:51:17 PDT 2011)
[18:25:45] 
[18:25:45] Preparing to commence simulation
[18:25:45] - Looking at optimizations...
[18:25:45] - Created dyn
[18:25:45] - Files status OK
[18:25:45] - Expanded 1488497 -> 2412340 (decompressed 162.0 percent)
[18:25:45] Called DecompressByteArray: compressed_data_size=1488497 data_size=2412340, decompressed_data_size=2412340 diff=0
[18:25:45] - Digital signature verified
[18:25:45] 
[18:25:45] Project: 3867 (Run 1685, Clone 0, Gen 3)
[18:25:45] 
[18:25:45] Assembly optimizations on if available.
[18:25:45] Entering M.D.
[18:25:51] Mapping NT from 1 to 1 
[18:25:53] Completed 0 out of 250000 steps  (0%)
[18:58:57] Completed 2500 out of 250000 steps  (1%)
[19:32:00] Completed 5000 out of 250000 steps  (2%)
[20:05:04] Completed 7500 out of 250000 steps  (3%)
[20:17:59] - Autosending finished units... [June 15 20:17:59 UTC]
[20:17:59] Trying to send all finished work units
[20:17:59] + No unsent completed units remaining.
[20:17:59] - Autosend completed
[20:38:07] Completed 10000 out of 250000 steps  (4%)
[21:11:15] Completed 12500 out of 250000 steps  (5%)
[21:44:38] Completed 15000 out of 250000 steps  (6%)
[22:18:58] Completed 17500 out of 250000 steps  (7%)
[22:54:05] Completed 20000 out of 250000 steps  (8%)
[23:28:08] Completed 22500 out of 250000 steps  (9%)
[00:02:53] Completed 25000 out of 250000 steps  (10%)
[00:38:03] Completed 27500 out of 250000 steps  (11%)
[01:11:05] Completed 30000 out of 250000 steps  (12%)
[01:44:15] Completed 32500 out of 250000 steps  (13%)
[02:17:59] - Autosending finished units... [June 16 02:17:59 UTC]
[02:17:59] Trying to send all finished work units
[02:17:59] + No unsent completed units remaining.
[02:17:59] - Autosend completed
[02:19:16] Completed 35000 out of 250000 steps  (14%)
[02:53:25] Completed 37500 out of 250000 steps  (15%)
[03:26:39] Completed 40000 out of 250000 steps  (16%)
[03:59:40] Completed 42500 out of 250000 steps  (17%)
[04:32:37] Completed 45000 out of 250000 steps  (18%)
[05:05:37] Completed 47500 out of 250000 steps  (19%)
[05:38:37] Completed 50000 out of 250000 steps  (20%)
[06:11:38] Completed 52500 out of 250000 steps  (21%)
[06:45:25] Completed 55000 out of 250000 steps  (22%)
[07:20:43] Completed 57500 out of 250000 steps  (23%)
[07:55:48] Completed 60000 out of 250000 steps  (24%)
[08:17:59] - Autosending finished units... [June 16 08:17:59 UTC]
[08:17:59] Trying to send all finished work units
[08:17:59] + No unsent completed units remaining.
[08:17:59] - Autosend completed
[08:30:43] Completed 62500 out of 250000 steps  (25%)
[09:05:42] Completed 65000 out of 250000 steps  (26%)
[09:39:24] Completed 67500 out of 250000 steps  (27%)
[10:13:36] Completed 70000 out of 250000 steps  (28%)
[10:47:02] Completed 72500 out of 250000 steps  (29%)
[11:20:03] Completed 75000 out of 250000 steps  (30%)
[11:53:05] Completed 77500 out of 250000 steps  (31%)
[12:26:20] Completed 80000 out of 250000 steps  (32%)
[13:01:13] Completed 82500 out of 250000 steps  (33%)
[13:36:20] Completed 85000 out of 250000 steps  (34%)
[14:10:45] Completed 87500 out of 250000 steps  (35%)
[14:17:59] - Autosending finished units... [June 16 14:17:59 UTC]
[14:17:59] Trying to send all finished work units
[14:17:59] + No unsent completed units remaining.
[14:17:59] - Autosend completed
[14:43:47] Completed 90000 out of 250000 steps  (36%)
[15:16:57] Completed 92500 out of 250000 steps  (37%)
[15:50:26] Completed 95000 out of 250000 steps  (38%)
[16:23:34] Completed 97500 out of 250000 steps  (39%)
[16:56:40] Completed 100000 out of 250000 steps  (40%)
[17:29:45] Completed 102500 out of 250000 steps  (41%)
[18:02:52] Completed 105000 out of 250000 steps  (42%)
[18:35:59] Completed 107500 out of 250000 steps  (43%)
[19:09:10] Completed 110000 out of 250000 steps  (44%)
[19:43:26] Completed 112500 out of 250000 steps  (45%)
[20:17:59] - Autosending finished units... [June 16 20:17:59 UTC]
[20:17:59] Trying to send all finished work units
[20:17:59] + No unsent completed units remaining.
[20:17:59] - Autosend completed
[20:18:17] Completed 115000 out of 250000 steps  (46%)
[20:52:34] Completed 117500 out of 250000 steps  (47%)
[21:27:29] Completed 120000 out of 250000 steps  (48%)
[22:01:42] Completed 122500 out of 250000 steps  (49%)
[22:36:56] Completed 125000 out of 250000 steps  (50%)
[23:11:45] Completed 127500 out of 250000 steps  (51%)
[23:46:44] Completed 130000 out of 250000 steps  (52%)
[00:21:48] Completed 132500 out of 250000 steps  (53%)
[00:56:32] Completed 135000 out of 250000 steps  (54%)
[01:31:29] Completed 137500 out of 250000 steps  (55%)
[02:06:25] Completed 140000 out of 250000 steps  (56%)
[02:18:13] - Autosending finished units... [June 17 02:18:13 UTC]
[02:18:13] Trying to send all finished work units
[02:18:13] + No unsent completed units remaining.
[02:18:14] - Autosend completed
[02:41:30] Completed 142500 out of 250000 steps  (57%)
[03:16:52] Completed 145000 out of 250000 steps  (58%)
[03:52:14] Completed 147500 out of 250000 steps  (59%)
[04:27:29] Completed 150000 out of 250000 steps  (60%)
[05:02:22] Completed 152500 out of 250000 steps  (61%)
[05:37:21] Completed 155000 out of 250000 steps  (62%)
[06:12:16] Completed 157500 out of 250000 steps  (63%)
[06:46:54] Completed 160000 out of 250000 steps  (64%)
[07:21:43] Completed 162500 out of 250000 steps  (65%)
[07:56:30] Completed 165000 out of 250000 steps  (66%)
[08:18:14] - Autosending finished units... [June 17 08:18:14 UTC]
[08:18:14] Trying to send all finished work units
[08:18:14] + No unsent completed units remaining.
[08:18:14] - Autosend completed
[08:19:02] ***** Windows logoff
[08:19:02] ***** Got a SIGTERM signal (2)
[08:19:02] Killing all core threads

Folding@Home Client Shutdown.



--- Opening Log file [June 17 18:08:21 UTC] 


# Windows CPU Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Documents and Settings\user\FAH\1
Executable: fah6
Arguments: -verbosity 9 

[18:08:21] - Ask before connecting: No
[18:08:21] - User name: desktopsvcs.com (Team 35054)
[18:08:21] - User ID: 7ACBEEAB1C1F35DA
[18:08:21] - Machine ID: 1
[18:08:21] 
[18:08:22] Loaded queue successfully.
[18:08:22] 
[18:08:22] - Autosending finished units... [June 17 18:08:22 UTC]
[18:08:22] + Processing work unit
[18:08:22] Trying to send all finished work units
[18:08:22] Core required: FahCore_a6.exe
[18:08:22] + No unsent completed units remaining.
[18:08:22] Core found.
[18:08:22] - Autosend completed
[18:08:22] Working on queue slot 00 [June 17 18:08:22 UTC]
[18:08:22] + Working ...
[18:08:22] - Calling '.\FahCore_a6.exe -dir work/ -suffix 00 -nocpulock -checkpoint 15 -verbose -lifeline 2892 -version 623'

[18:08:22] 
[18:08:22] *------------------------------*
[18:08:22] Folding@Home Gromacs Core
[18:08:22] Version 2.28 (Wed Mar 23 13:51:17 PDT 2011)
[18:08:22] 
[18:08:22] Preparing to commence simulation
[18:08:22] - Ensuring status. Please wait.
[18:08:31] - Looking at optimizations...
[18:08:31] - Working with standard loops on this execution.
[18:08:31] - Previous termination of core was improper.
[18:08:31] - Files status OK
[18:08:32] - Expanded 1488497 -> 2412340 (decompressed 162.0 percent)
[18:08:32] Called DecompressByteArray: compressed_data_size=1488497 data_size=2412340, decompressed_data_size=2412340 diff=0
[18:08:32] - Digital signature verified
[18:08:32] 
[18:08:32] Project: 3867 (Run 1685, Clone 0, Gen 3)
[18:08:32] 
[18:08:32] Entering M.D.
[18:08:38] Using Gromacs checkpoints
[18:08:39] Mapping NT from 1 to 1 
[18:08:40] Resuming from checkpoint
[18:08:40] Verified work/wudata_00.log
[18:08:40] Verified work/wudata_00.trr
[18:08:40] Verified work/wudata_00.xtc
[18:08:40] Verified work/wudata_00.edr
[18:08:42] Completed 165940 out of 250000 steps  (66%)
[18:08:55] mdrun returned 255
[18:08:55] Going to send back what have done -- stepsTotalG=250000
[18:08:55] Work fraction=0.6638 steps=250000.
[18:08:59] logfile size=106517 infoLength=106517 edr=0 trr=25
[18:08:59] logfile size: 106517 info=106517 bed=0 hdr=25
[18:08:59] - Writing 107055 bytes of core data to disk...
[18:09:00] Done: 106543 -> 15037 (compressed to 14.1 percent)
[18:09:00]   ... Done.
[18:09:01] 
[18:09:01] Folding@home Core Shutdown: EARLY_UNIT_END
[18:09:04] CoreStatus = 72 (114)
[18:09:04] Sending work to server
[18:09:04] Project: 3867 (Run 1685, Clone 0, Gen 3)


[18:09:04] + Attempting to send results [June 17 18:09:04 UTC]
[18:09:04] - Reading file work/wuresults_00.dat from core
[18:09:04]   (Read 15549 bytes from disk)
[18:09:04] Connecting to http://128.143.48.226:8080/
[18:09:04] Posted data.
[18:09:04] Initial: 0000; Conversation time very short, giving reduced weight in bandwidth avg
[18:09:04] - Uploaded at ~32 kB/s
[18:09:04] - Averaged speed for that direction ~153 kB/s
[18:09:04] + Results successfully sent
[18:09:04] Thank you for your contribution to Folding@Home.

Re: Project: 3867 (Run 1685, Clone 0, Gen 3) Early Unit End

Posted: Sun Jun 19, 2011 12:12 am
by GreyWhiskers
How did you shut the v7 FAHControl GUI? Did you "carefully" pause the slots, then exit the GUI?

Anecdotally, I've been reporting on an instance where I got similar results when I "carefully" paused the core, then exited the GUI on a Project: 3866 prior to reboot. But, I had two reboots of the WU that I picked up next, Project: 3865 (Run 9337, Clone 0, Gen 1); I just let Windows do its crude termination - and the client/core a6 picked up the WU from its checkpoint and proceeded normally.

This may or may not be the issue, but "pulling the plug" seemed to be better than careful shutdown. :eo

See thread Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Re: Project: 3867 (Run 1685, Clone 0, Gen 3) Early Unit End

Posted: Sun Jun 19, 2011 2:19 am
by karlinhigh
Well, this is version 6.23, not 7. I just haven't set one of those up yet. But thanks for the advice; I'll keep that in mind once I upgrade.

Re: Project: 3867 (Run 1685, Clone 0, Gen 3) Early Unit End

Posted: Sun Jun 19, 2011 3:06 am
by bruce
GreyWhiskers wrote:Anecdotally, I've been reporting on an instance where I got similar results when I "carefully" paused the core, then exited the GUI on a Project: 3866 prior to reboot. But, I had two reboots of the WU that I picked up next, Project: 3865 (Run 9337, Clone 0, Gen 1); I just let Windows do its crude termination - and the client/core a6 picked up the WU from its checkpoint and proceeded normally.

This may or may not be the issue, but "pulling the plug" seemed to be better than careful shutdown. :eo
Shutting down a couple of times can lead to anecdotal conclusions that may be right and may be wrong (based on the extremely small sample size). I think it's fair to say that nobody yet knows why core A6 restarts sometimes and fails other times. Development is working on isolating and fixing the bug so I'm confident that we'll see a new version of Core_a6 one of these days.
karlinhigh wrote:Well, this is version 6.23, not 7. I just haven't set one of those up yet. But thanks for the advice; I'll keep that in mind once I upgrade.
Both clients are running the same Core A6 so there's no obvious reason for the problem to be different.

Re: Project: 3867 (Run 1685, Clone 0, Gen 3) Early Unit End

Posted: Mon Jun 20, 2011 1:59 am
by mrshirts
Thanks very much for all this information. Most of the WU are returning just fine, but there are a small but nonnegligible fraction that are having these unexplained crashes. We're still in the process of trying to replicate and fix this bug -- it's one of those horrible bugs where when you even put add a breakpoint in the debugger, the location of the crash changes -- this suggest a rare memory error, but it's not being picked up by Valgrind. So it's been very trial and error! These extra reports are very useful!

The biggest reason we are not just pulling everything back to at least advanced is that is project is for the F@H entry in a drug design challenge competition (http://sampl.eyesopen.com/). The deadline was recently extended to June 30th, so we're trying to push through what we can by then; at which case, we will be pulling back to advanced until we are sure we have the bug licked. It's been a really big challenge to try to decide whether to pull back because of the faulty WU, and sacrifice having enough data to show for the challenge, or to get the user base worried! I'll be monitoring the forums to try to strike the best balance! Thanks for all your patience with this, and please post any other details you can manage to find.

Re: Project: 3867 (Run 1685, Clone 0, Gen 3) Early Unit End

Posted: Mon Jun 20, 2011 2:33 am
by karlinhigh
Fold on, mrshirts. I can live with these issues, and I'm running -advmethods anyway.