Page 1 of 1

Project: 6904 (Run 0, Clone 49, Gen 34)

Posted: Wed Feb 08, 2012 5:36 pm
by WitchDoctorB
I don't post here very often, but I got this work unit on my 48-core @2.5 GHz...

Code: Select all

04:57:41:Unit 01:Project: 6904 (Run 0, Clone 49, Gen 34)
04:57:41:Unit 01:
04:57:41:Unit 01:Assembly optimizations on if available.
04:57:41:Unit 01:Entering M.D.
04:57:47:Unit 00: 2.06%
04:57:50:Unit 01:Mapping NT from 48 to 48 
04:57:53:Unit 00: 3.21%
04:57:58:Unit 01:Completed 0 out of 8750000 steps  (0%)
04:57:59:Unit 00: 4.25%
04:58:05:Unit 00: 5.32%
04:58:11:Unit 00: 6.36%
..."Edit out upload"
05:06:47:Unit 00: 98.79%
05:06:53:Unit 00: 99.87%
05:07:18:Unit 00: Upload complete
05:07:18:Server responded WORK_ACK (400)
05:07:18:Final credit estimate, 589810.00 points
05:07:18:Cleaning up Unit 00
13:37:53:Slot 00 paused
13:37:53:Slot 00: shutting core down
13:38:54:WARNING: Killing Unit 01
13:38:54:FahCore, running Unit 01, terminated.
13:39:15:Slot 00 unpaused
13:39:15:Starting Unit 01
13:39:15:Running core: /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -lifeline 1825 -version 701 -checkpoint 15 -np 48
13:39:15:Started core on PID 7181
13:39:15:FahCore 0xa5 started
13:39:15:Started thread 107 on PID 1825
13:39:16:Unit 01:
13:39:16:Unit 01:*------------------------------*
13:39:16:Unit 01:Folding@Home Gromacs SMP Core
13:39:16:Unit 01:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
13:39:16:Unit 01:
13:39:16:Unit 01:Preparing to commence simulation
13:39:16:Unit 01:- Ensuring status. Please wait.
13:39:25:Unit 01:- Looking at optimizations...
13:39:25:Unit 01:- Working with standard loops on this execution.
13:39:25:Unit 01:- Previous termination of core was improper.
13:39:25:Unit 01:- Files status OK
13:39:30:Unit 01:- Expanded 46502931 -> 71843392 (decompressed 62.1 percent)
13:39:30:Unit 01:Called DecompressByteArray: compressed_data_size=46502931 data_size=71843392, decompressed_data_size=71843392 diff=0
13:39:31:Unit 01:- Digital signature verified
13:39:31:Unit 01:
13:39:31:Unit 01:Project: 6904 (Run 0, Clone 49, Gen 34)
13:39:31:Unit 01:
13:39:31:Unit 01:Entering M.D.
13:39:37:Unit 01:Using Gromacs checkpoints
13:39:44:Unit 01:Mapping NT from 48 to 48 
13:41:35:Unit 01:Resuming from checkpoint
13:41:44:Unit 01:Verified 01/wudata_01.log
13:41:45:Unit 01:Verified 01/wudata_01.trr
13:41:45:Unit 01:Verified 01/wudata_01.xtc
13:41:45:Unit 01:Verified 01/wudata_01.edr
13:41:46:Unit 01:Completed 49590 out of 8750000 steps  (0%)
16:59:34:Slot 00 paused
16:59:34:Slot 00: shutting core down
16:59:34:WARNING: FahCore is known to not shutdown cleanly, killing
16:59:34:WARNING: Killing Unit 01
16:59:35:FahCore, running Unit 01, terminated.
16:59:46:Slot 00 unpaused
16:59:46:Starting Unit 01
16:59:46:Running core: /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -lifeline 1825 -version 701 -checkpoint 15 -np 48
16:59:46:Started core on PID 7787
16:59:46:FahCore 0xa5 started
16:59:46:Started thread 108 on PID 1825
16:59:47:Unit 01:
16:59:47:Unit 01:*------------------------------*
16:59:47:Unit 01:Folding@Home Gromacs SMP Core
16:59:47:Unit 01:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
16:59:47:Unit 01:
16:59:47:Unit 01:Preparing to commence simulation
16:59:47:Unit 01:- Ensuring status. Please wait.
16:59:56:Unit 01:- Looking at optimizations...
16:59:56:Unit 01:- Working with standard loops on this execution.
16:59:56:Unit 01:- Previous termination of core was improper.
16:59:56:Unit 01:- Going to use standard loops.
16:59:56:Unit 01:- Files status OK
17:00:01:Unit 01:- Expanded 46502931 -> 71843392 (decompressed 62.1 percent)
17:00:01:Unit 01:Called DecompressByteArray: compressed_data_size=46502931 data_size=71843392, decompressed_data_size=71843392 diff=0
17:00:02:Unit 01:- Digital signature verified
17:00:02:Unit 01:
17:00:02:Unit 01:Project: 6904 (Run 0, Clone 49, Gen 34)
17:00:02:Unit 01:
17:00:02:Unit 01:Entering M.D.
17:00:08:Unit 01:Using Gromacs checkpoints
17:00:15:Unit 01:Mapping NT from 48 to 48 
17:01:50:Unit 01:Resuming from checkpoint
17:02:06:Unit 01:Verified 01/wudata_01.log
17:02:07:Unit 01:Verified 01/wudata_01.trr
17:02:07:Unit 01:Verified 01/wudata_01.xtc
17:02:07:Unit 01:Verified 01/wudata_01.edr
17:02:08:Unit 01:Completed 67140 out of 8750000 steps  (0%)
I thought it had locked up so I paused then restarted, but then I noticed that it was 8750000 steps instead of 250000. By my calculations this will take about 62 days to complete based upon the 8.5 hours to get to 49590. Then about 3:20 to get to 67140. Is this correct or is this a bad work unit??

Re: Project: 6904 (Run 0, Clone 49, Gen 34)

Posted: Wed Feb 08, 2012 5:52 pm
by Macaholic
Looks like a bad unit. Similar thread here explaining it in a bit more detail.
kasson wrote:We have identified ~30 WU's that have too many steps. These appear related to an event on Jan 16 where some incoming returns were not written properly (the return was credited, but some of the data wasn't written). We are manually re-running those WU's and will re-generate the new ones as soon as we can.

Re: Project: 6904 (Run 0, Clone 49, Gen 34)

Posted: Wed Feb 08, 2012 6:00 pm
by WitchDoctorB
Thanks. I missed that in the search because I did "6904 clone 49" in the string.