Page 1 of 1

Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Sat Nov 08, 2008 12:23 am
by dmearns
Now this is weird:

Code: Select all

[15:07:28]
[15:07:28] + Processing work unit
[15:07:28] Core required: FahCore_a2.exe
[15:07:28] Core found.
[15:07:28] Working on Unit 07 [November 7 15:07:28]
[15:07:28] + Working ...
[15:07:28] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 5521 -version 602'

[15:07:28]
[15:07:28] *------------------------------*
[15:07:28] Folding@Home Gromacs SMP Core
[15:07:28] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[15:07:28]
[15:07:28] Preparing to commence simulation
[15:07:28] - Ensuring status. Please wait.
[15:07:38] - Assembly optimizations manually forced on.
[15:07:38] - Not checking prior termination.
[15:07:41] - Expanded 4836721 -> 23985345 (decompressed 495.9 percent)
[15:07:41] Called DecompressByteArray: compressed_data_size=4836721 data_size=23985345, decompressed_data_size=23985345 diff=0
[15:07:41] - Digital signature verified
[15:07:41]
[15:07:41] Project: 2669 (Run 17, Clone 49, Gen 20)
[15:07:41]
[15:07:41] Assembly optimizations on if available.
[15:07:41] Entering M.D.
[15:07:47] Will resume from checkpoint file
[15:07:50] Resuming from checkpoint
[15:07:51] fcSaveRestoreState: I/O failed dir=0, var=0000000003870650, varsize=573564
[15:07:51] fcSaveRestoreState: I/O failed dir=0, var=0000000003A1E4A0, varsize=573564
[15:07:51] Verified work/wudata_07.log
[15:07:51] Verified work/wudata_07.trr
[15:07:51] Verified work/wudata_07.xtc
[15:07:51] Verified work/wudata_07.edr
[15:07:51] Completed 825012 out of 250000 steps  (330%)
[15:33:08] Completed 827502 out of 250000 steps  (331%)
[15:58:31] Completed 830002 out of 250000 steps  (332%)
[16:23:46] Completed 832502 out of 250000 steps  (333%)
[16:49:07] Completed 835002 out of 250000 steps  (334%)
[17:00:03] - Autosending finished units...
[17:00:03] Trying to send all finished work units
[17:00:03] + No unsent completed units remaining.
[17:00:03] - Autosend completed
[17:14:33] Completed 837502 out of 250000 steps  (335%)
[17:39:50] Completed 840002 out of 250000 steps  (336%)
[18:05:09] Completed 842502 out of 250000 steps  (337%)
Is this just a reporting problem or is this WU messed up?

- Dave

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Sat Nov 08, 2008 12:34 am
by parkut
what does the unitinfo.txt file say? if you have access to QD-tools, what does that reveal?

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Sat Nov 08, 2008 12:40 am
by dmearns
parkut wrote:what does the unitinfo.txt file say? if you have access to QD-tools, what does that reveal?

Code: Select all

Current Work Unit
-----------------
Name: Gromacs
Tag: P2669R17C49G20
Download time: November 7 15:07:28
Due time: November 10 15:07:28
Progress: 352%  [|||||||||||||||||||||||||||||||||||]

Code: Select all

 Index 7: folding now 27.3 X min speed; 352% complete
  server: 171.64.65.56:8080; project: 2669
  Folding: run 17, clone 49, generation 20; benchmark 0; misc: 500, 200
  issue: Fri Nov  7 10:07:20 2008; begin: Fri Nov  7 10:07:28 2008
  expect: Fri Nov  7 12:45:57 2008; due: Mon Nov 10 10:07:28 2008 (3 days)
  core URL: http://www.stanford.edu/~pande/Linux/x86Core_a2.fah (V2.01)
  CPU: 1,0 x86; OS: 4,0 Linux
  assignment info (le): Fri Nov  7 10:07:19 2008; BBFE10E0
  CS: 171.67.108.25; P limit: 524286976
  user: river; team: 13149; ID: EA7DA9587F20D637; mach ID: 1
  work/wudata_07.dat file size: 4837233; WU type: Folding@Home

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Sat Nov 08, 2008 1:01 pm
by parkut
I'm going to guess it's a cosmetic issue. Keep an eye on it, at 25 minutes, 23 seconds per fame completed, that would be done in about 39 or 40 hours...

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Mon Nov 10, 2008 6:42 pm
by dmearns
Well it looks like it was a real problem after all:

Code: Select all

[12:44:58] Completed 1277502 out of 250000 steps  (511%)
[13:04:31] Completed 1280002 out of 250000 steps  (512%)
[13:24:07] Completed 1282502 out of 250000 steps  (513%)
[13:43:38] Completed 1285002 out of 250000 steps  (514%)
[14:03:11] Completed 1287502 out of 250000 steps  (515%)
[14:22:42] Completed 1290002 out of 250000 steps  (516%)
[14:42:13] Completed 1292502 out of 250000 steps  (517%)
[15:02:03] Completed 1295002 out of 250000 steps  (518%)
[15:27:22] Completed 1297502 out of 250000 steps  (519%)
[15:27:22] Unit 7's deadline (November 10 15:07) has passed.
[15:27:22] Going to interrupt core and move on to next unit...
[15:27:23] CoreStatus = 0 (0)
[15:27:23] Client-core communications error: ERROR 0x0
[15:27:23] Deleting current work unit & continuing...
[15:27:38] - Warning: Could not delete all work unit files (7): Core file absent
[15:27:38] Trying to send all finished work units
[15:27:38] + No unsent completed units remaining.
[15:27:38] - Preparing to get new work unit...
[15:27:38] + Attempting to get work packet
And it failed to kill the core processes, so I had 2 sets going until I killed the old ones manually.

- Dave

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Fri Nov 14, 2008 1:23 pm
by verdeva
I had this same thing just occur on a 2669, except it started at 198% and is now at 325%.

Based on what I read here, I'm going to delete this WU.

Project: 2669 (Run 7, Clone 165, Gen 18)

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Fri Nov 14, 2008 6:35 pm
by kasson
It looks like the problem was with the checkpoint files--try clearing your checkpoint files and restarting. (It will restart the WU from the beginning.)

Re: Project: 2669 (Run 17, Clone 49, Gen 20) - starts at 330%

Posted: Fri Nov 14, 2008 7:17 pm
by dmearns
kasson wrote:It looks like the problem was with the checkpoint files--try clearing your checkpoint files and restarting. (It will restart the WU from the beginning.)
Thanks. Would the checkpoint files be state.cpt and state_prev.cpt?

- Dave