Good WU gone bad?
Posted: Sat Apr 03, 2010 10:57 pm
I've never seen this before, where the results of a good WU that hasn't been sent suddenly turn bad. There was a system failure on my laptop (the Lenovo), and I had to shut down the machine, but the WU results had already been saved. When I restarted it, I got a message, "Bad Index stored in Work Queue. Has been reset to zero.", but I assumed that referred to a bad wu [which may have been bad due the error on the machine, since it is now reprocessing (so far successfully @45%) the same wu.
Don't know why it skipped slot 00. I checked the work folder and there's no wuresults_00 file. However, I still have wuresults_07 [sent] and _08. Also the files for WU 09 are still hanging around. Should I delete them, or will they get cleaned up?
Code: Select all
# Windows SMP Console Edition #################################################
###############################################################################
Folding@Home Client Version 6.29
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\SMP
Executable: C:\SMP\SMP.exe
Arguments: -smp -deino -verbosity 9
[23:04:21] - Ask before connecting: No
[23:04:21] - User name: anko1 (Team 47815)
[23:04:21] - User ID: 74362D0123xxxx
[23:04:21] - Machine ID: 1
[23:04:21]
[23:04:21] Loaded queue successfully.
[23:04:21]
[23:04:21] - Autosending finished units... [April 2 23:04:21 UTC]
[23:04:21] + Processing work unit
[23:04:21] Trying to send all finished work units
[23:04:21] Core required: FahCore_a3.exe
[23:04:21] Project: 6013 (Run 0, Clone 31, Gen 134)
[23:04:21] Core found.
[23:04:21] + Attempting to send results [April 2 23:04:21 UTC]
[23:04:21] - Reading file work/wuresults_07.dat from core
[23:04:21] Working on queue slot 08 [April 2 23:04:21 UTC] <---------Start WU 08
[23:04:25] + Working ...
[23:04:25] (Read 25887675 bytes from disk)
[23:04:25] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 4 -checkpoint 15 -verbose -lifeline 2440 -version 629'
[23:04:25] Connecting to http://130.237.232.140:8080/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25] (130.237.232.140:8080)
[23:04:25] + Retrying using alternative port
[23:04:25] Connecting to http://130.237.232.140:80/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25] (130.237.232.140:80)
[23:04:25] - Error: Could not transmit unit 07 (completed April 2) to work server.
[23:04:25] - 4 failed uploads of this unit.
[23:04:25] + Attempting to send results [April 2 23:04:25 UTC]
[23:04:25] - Reading file work/wuresults_07.dat from core
[23:04:25] (Read 25887675 bytes from disk)
[23:04:25] Connecting to http://130.237.162.125:8080/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25] (130.237.162.125:8080)
[23:04:25] + Retrying using alternative port
[23:04:25] Connecting to http://130.237.162.125:80/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25] (130.237.162.125:80)
[23:04:25] Could not transmit unit 07 to Collection server; keeping in queue.
[23:04:25] + Sent 0 of 1 completed units to the server
[23:04:25] - Autosend completed
[23:04:27]
[23:04:28] *------------------------------*
[23:04:28] Folding@Home Gromacs SMP Core
[23:04:28] Version 2.17 (Mar 12, 2010)
[23:04:28]
[23:04:28] Preparing to commence simulation
[23:04:28] - Ensuring status. Please wait.
[23:04:36] - Looking at optimizations...
[23:04:36] - Working with standard loops on this execution.
[23:04:36] - Previous termination of core was improper.
[23:04:36] - Files status OK
[23:04:37] - Expanded 1768768 -> 1971489 (decompressed 111.4 percent)
[23:04:37] Called DecompressByteArray: compressed_data_size=1768768 data_size=1971489, decompressed_data_size=1971489 diff=0
[23:04:37] - Digital signature verified
[23:04:38]
[23:04:38] Project: 6020 (Run 0, Clone 1, Gen 112)
[23:04:38]
[23:04:38] Entering M.D.
[23:04:46] Completed 0 out of 500000 steps (0%)
[23:12:06] Completed 5000 out of 500000 steps (1%)
[snip}
[05:04:25] + Attempting to send results [April 3 05:04:25 UTC]
[05:04:25] - Reading file work/wuresults_07.dat from core <----------WU 07 sent
[05:04:25] (Read 25887675 bytes from disk)
[05:04:25] Connecting to http://130.237.232.140:8080/
[05:06:04] Completed 220000 out of 500000 steps (44%)
[05:11:04] Posted data.
[05:11:04] Initial: 0000; - Uploaded at ~63 kB/s
[05:11:04] - Averaged speed for that direction ~195 kB/s
[05:11:04] + Results successfully sent
{snip}
[13:31:53] Completed 500000 out of 500000 steps (100%)
[13:31:54] DynamicWrapper: Finished Work Unit: sleep=10000
[13:32:04]
[13:32:04] Finished Work Unit:
[13:32:04] - Reading up to 20302920 from "work/wudata_08.trr": Read 20302920
[13:32:04] trr file hash check passed.
[13:32:04] edr file hash check passed.
[13:32:04] logfile size: 58828
[13:32:04] Leaving Run
[13:32:07] - Writing 20396024 bytes of core data to disk...
[13:32:08] ... Done.
[13:32:13] - Shutting down core
[13:32:13]
[13:32:13] Folding@home Core Shutdown: FINISHED_UNIT
[13:32:15] CoreStatus = 64 (100) <------Finish WU 08 [Good]
[13:32:16] Unit 8 finished with 87 percent of time to deadline remaining.
[13:32:16] Updated performance fraction: 0.842166
[13:32:16] Sending work to server
[13:32:16] Project: 6020 (Run 0, Clone 1, Gen 112)
[13:32:16] + Attempting to send results [April 3 13:32:16 UTC]
[13:32:16] - Reading file work/wuresults_08.dat from core
[13:32:16] (Read 20396024 bytes from disk)
[13:32:16] Connecting to http://171.64.65.54:8080/
[13:32:16] - Couldn't send HTTP request to server
[13:32:16] (Got status 503)
[13:32:16] + Could not connect to Work Server (results)
[13:32:16] (171.64.65.54:8080)
[13:32:16] + Retrying using alternative port
{snip}
[15:00:39] + Processing work unit
[15:00:39] Core required: FahCore_a3.exe
[15:00:39] Core found.
[15:00:39] Working on queue slot 09 [April 3 15:00:39 UTC]
[15:00:39] + Working ...
[15:00:39] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 09 -np 4 -checkpoint 15 -verbose -lifeline 2440 -version 629'
[15:00:39]
[15:00:39] *------------------------------*
[15:00:39] Folding@Home Gromacs SMP Core
[15:00:39] Version 2.17 (Mar 12, 2010)
[15:00:39]
[15:00:39] Preparing to commence simulation
[15:00:39] - Looking at optimizations...
[15:00:39] - Created dyn
[15:00:39] - Files status OK
[15:00:40] - Expanded 1797187 -> 2078149 (decompressed 115.6 percent)
[15:00:40] Called DecompressByteArray: compressed_data_size=1797187 data_size=2078149, decompressed_data_size=2078149 diff=0
[15:00:40] - Digital signature verified
[15:00:40]
[15:00:40] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:00:40]
[15:00:40] Assembly optimizations on if available.
[15:00:40] Entering M.D.
[15:00:46] Completed 0 out of 500000 steps (0%)
[15:08:49] Completed 5000 out of 500000 steps (1%)
[15:16:57] Completed 10000 out of 500000 steps (2%)
[15:25:03] Completed 15000 out of 500000 steps (3%)
[15:33:29] Completed 20000 out of 500000 steps (4%)
[15:44:30] Completed 25000 out of 500000 steps (5%)
[15:45:47] Caught exception!
[15:45:47] mdrun returned 255
[15:45:47] Going to send back what have done -- stepsTotalG=500000
[15:45:47] Work fraction=0.0512 steps=500000.
[15:45:51] logfile size=13919 infoLength=13919 edr=0 trr=25
[15:45:51] logfile size: 13919 info=13919 bed=0 hdr=25
[15:45:51] Warning: Core could not open logfile work/wudata_09.log; steps completed=500000.
[15:45:51] - Could not open results file
[15:45:51]
[15:45:51] Folding@home Core Shutdown: EARLY_UNIT_END
[15:45:51] Couldn't delete work/core78.sta.
[15:45:54] CoreStatus = 72 (114)
[15:45:54] Sending work to server
[15:45:54] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:45:54] - Error: Could not get length of results file work/wuresults_09.dat
[15:45:54] - Error: Could not read unit 09 file. Removing from queue.
[15:45:54] Trying to send all finished work units
[15:45:54] Project: 6020 (Run 0, Clone 1, Gen 112)
[15:45:54] - Error: Could not get length of results file work/wuresults_08.dat <---- WU 08 gone bad?
[15:45:54] - Error: Could not read unit 08 file. Removing from queue.
[15:45:54] + Sent 0 of 1 completed units to the server
[15:45:54] - Preparing to get new work unit...
[15:45:54] Cleaning up work directory
[15:45:54] + Attempting to get work packet
{snip}
[15:57:18] - Attempt #8 to get work failed, and no other work to do.
Waiting before retry. <------ Hard stop
--- Opening Log file [April 3 17:33:22 UTC]
# Windows SMP Console Edition #################################################
###############################################################################
Folding@Home Client Version 6.29
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\SMP
Executable: C:\SMP\SMP.exe
Arguments: -smp -deino -verbosity 9
[17:33:22] - Ask before connecting: No
[17:33:22] - User name: anko1 (Team 47815)
[17:33:22] - User ID: 74362D012xxxxx
[17:33:22] - Machine ID: 1
[17:33:22]
[17:33:22] Loaded queue successfully.
[17:33:22] - Bad Index stored in Work Queue. Has been reset to zero. <---------- Is this a clue?
[17:33:22] - Preparing to get new work unit...
[17:33:22] - Autosending finished units... [April 3 17:33:22 UTC]
[17:33:22] Cleaning up work directory
[17:33:22] Trying to send all finished work units
[17:33:22] + No unsent completed units remaining.
[17:33:22] - Autosend completed
[17:33:22] + Attempting to get work packet
[17:33:22] Passkey found
[17:33:22] - Will indicate memory of 2553 MB
[17:33:22] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[17:33:22] - Connecting to assignment server
[17:33:22] Connecting to http://assign.stanford.edu:8080/
[17:33:22] Posted data.
[17:33:22] Initial: ED82; - Successful: assigned to (130.237.232.140).
[17:33:22] + News From Folding@Home: Welcome to Folding@Home
[17:33:23] Loaded queue successfully.
[17:33:23] Connecting to http://130.237.232.140:8080/
[17:33:23] Posted data.
[17:33:23] Initial: 0000; - Receiving payload (expected size: 1797699)
[17:33:28] - Downloaded at ~351 kB/s
[17:33:28] - Averaged speed for that direction ~351 kB/s
[17:33:28] + Received work.
[17:33:28] + Closed connections
[17:33:28]
[17:33:28] + Processing work unit
[17:33:28] Core required: FahCore_a3.exe
[17:33:28] Core found.
[17:33:28] Working on queue slot 01 [April 3 17:33:28 UTC]
[17:33:28] + Working ...
[17:33:28] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 4 -checkpoint 15 -verbose -lifeline 2448 -version 629'
[17:33:28]
[17:33:28] *------------------------------*
[17:33:28] Folding@Home Gromacs SMP Core
[17:33:28] Version 2.17 (Mar 12, 2010)
[17:33:28]
[17:33:28] Preparing to commence simulation
[17:33:28] - Ensuring status. Please wait.
[17:33:38] - Looking at optimizations...
[17:33:38] - Working with standard loops on this execution.
[17:33:38] - Created dyn
[17:33:38] - Files status OK
[17:33:38] - Expanded 1797187 -> 2078149 (decompressed 115.6 percent)
[17:33:38] Called DecompressByteArray: compressed_data_size=1797187 data_size=2078149, decompressed_data_size=2078149 diff=0
[17:33:38] - Digital signature verified
[17:33:38]
[17:33:38] Project: 6012 (Run 1, Clone 4, Gen 42)
[17:33:38]
[17:33:38] Entering M.D.
[17:33:45] Completed 0 out of 500000 steps (0%)
[17:40:30] Completed 5000 out of 500000 steps (1%)
Don't know why it skipped slot 00. I checked the work folder and there's no wuresults_00 file. However, I still have wuresults_07 [sent] and _08. Also the files for WU 09 are still hanging around. Should I delete them, or will they get cleaned up?