Page 1 of 1

Good WU gone bad?

Posted: Sat Apr 03, 2010 10:57 pm
by anko1
I've never seen this before, where the results of a good WU that hasn't been sent suddenly turn bad. There was a system failure on my laptop (the Lenovo), and I had to shut down the machine, but the WU results had already been saved. When I restarted it, I got a message, "Bad Index stored in Work Queue. Has been reset to zero.", but I assumed that referred to a bad wu [which may have been bad due the error on the machine, since it is now reprocessing (so far successfully @45%) the same wu.

Code: Select all

# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\SMP
Executable: C:\SMP\SMP.exe
Arguments: -smp -deino -verbosity 9 

[23:04:21] - Ask before connecting: No
[23:04:21] - User name: anko1 (Team 47815)
[23:04:21] - User ID: 74362D0123xxxx
[23:04:21] - Machine ID: 1
[23:04:21] 
[23:04:21] Loaded queue successfully.
[23:04:21] 
[23:04:21] - Autosending finished units... [April 2 23:04:21 UTC]
[23:04:21] + Processing work unit
[23:04:21] Trying to send all finished work units
[23:04:21] Core required: FahCore_a3.exe
[23:04:21] Project: 6013 (Run 0, Clone 31, Gen 134)
[23:04:21] Core found.


[23:04:21] + Attempting to send results [April 2 23:04:21 UTC]
[23:04:21] - Reading file work/wuresults_07.dat from core
[23:04:21] Working on queue slot 08 [April 2 23:04:21 UTC]                                             <---------Start WU 08
[23:04:25] + Working ...
[23:04:25]   (Read 25887675 bytes from disk)
[23:04:25] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 4 -checkpoint 15 -verbose -lifeline 2440 -version 629'

[23:04:25] Connecting to http://130.237.232.140:8080/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.232.140:8080)
[23:04:25] + Retrying using alternative port
[23:04:25] Connecting to http://130.237.232.140:80/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.232.140:80)
[23:04:25] - Error: Could not transmit unit 07 (completed April 2) to work server.
[23:04:25] - 4 failed uploads of this unit.


[23:04:25] + Attempting to send results [April 2 23:04:25 UTC]
[23:04:25] - Reading file work/wuresults_07.dat from core
[23:04:25]   (Read 25887675 bytes from disk)
[23:04:25] Connecting to http://130.237.162.125:8080/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.162.125:8080)
[23:04:25] + Retrying using alternative port
[23:04:25] Connecting to http://130.237.162.125:80/
[23:04:25] - Couldn't send HTTP request to server
[23:04:25] + Could not connect to Work Server (results)
[23:04:25]     (130.237.162.125:80)
[23:04:25]   Could not transmit unit 07 to Collection server; keeping in queue.
[23:04:25] + Sent 0 of 1 completed units to the server
[23:04:25] - Autosend completed
[23:04:27] 
[23:04:28] *------------------------------*
[23:04:28] Folding@Home Gromacs SMP Core
[23:04:28] Version 2.17 (Mar 12, 2010)
[23:04:28] 
[23:04:28] Preparing to commence simulation
[23:04:28] - Ensuring status. Please wait.
[23:04:36] - Looking at optimizations...
[23:04:36] - Working with standard loops on this execution.
[23:04:36] - Previous termination of core was improper.
[23:04:36] - Files status OK
[23:04:37] - Expanded 1768768 -> 1971489 (decompressed 111.4 percent)
[23:04:37] Called DecompressByteArray: compressed_data_size=1768768 data_size=1971489, decompressed_data_size=1971489 diff=0
[23:04:37] - Digital signature verified
[23:04:38] 
[23:04:38] Project: 6020 (Run 0, Clone 1, Gen 112)
[23:04:38] 
[23:04:38] Entering M.D.
[23:04:46] Completed 0 out of 500000 steps  (0%)
[23:12:06] Completed 5000 out of 500000 steps  (1%)
                      [snip}

[05:04:25] + Attempting to send results [April 3 05:04:25 UTC]
[05:04:25] - Reading file work/wuresults_07.dat from core                                               <----------WU 07 sent
[05:04:25]   (Read 25887675 bytes from disk)
[05:04:25] Connecting to http://130.237.232.140:8080/
[05:06:04] Completed 220000 out of 500000 steps  (44%)
[05:11:04] Posted data.
[05:11:04] Initial: 0000; - Uploaded at ~63 kB/s
[05:11:04] - Averaged speed for that direction ~195 kB/s
[05:11:04] + Results successfully sent
                     {snip}
[13:31:53] Completed 500000 out of 500000 steps  (100%)
[13:31:54] DynamicWrapper: Finished Work Unit: sleep=10000
[13:32:04] 
[13:32:04] Finished Work Unit:
[13:32:04] - Reading up to 20302920 from "work/wudata_08.trr": Read 20302920
[13:32:04] trr file hash check passed.
[13:32:04] edr file hash check passed.
[13:32:04] logfile size: 58828
[13:32:04] Leaving Run
[13:32:07] - Writing 20396024 bytes of core data to disk...
[13:32:08]   ... Done.
[13:32:13] - Shutting down core
[13:32:13] 
[13:32:13] Folding@home Core Shutdown: FINISHED_UNIT
[13:32:15] CoreStatus = 64 (100)                                                                                     <------Finish WU 08 [Good]
[13:32:16] Unit 8 finished with 87 percent of time to deadline remaining.
[13:32:16] Updated performance fraction: 0.842166
[13:32:16] Sending work to server
[13:32:16] Project: 6020 (Run 0, Clone 1, Gen 112)


[13:32:16] + Attempting to send results [April 3 13:32:16 UTC]
[13:32:16] - Reading file work/wuresults_08.dat from core
[13:32:16]   (Read 20396024 bytes from disk)
[13:32:16] Connecting to http://171.64.65.54:8080/
[13:32:16] - Couldn't send HTTP request to server
[13:32:16]   (Got status 503)
[13:32:16] + Could not connect to Work Server (results)
[13:32:16]     (171.64.65.54:8080)
[13:32:16] + Retrying using alternative port

{snip}

[15:00:39] + Processing work unit
[15:00:39] Core required: FahCore_a3.exe
[15:00:39] Core found.
[15:00:39] Working on queue slot 09 [April 3 15:00:39 UTC]
[15:00:39] + Working ...
[15:00:39] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 09 -np 4 -checkpoint 15 -verbose -lifeline 2440 -version 629'

[15:00:39] 
[15:00:39] *------------------------------*
[15:00:39] Folding@Home Gromacs SMP Core
[15:00:39] Version 2.17 (Mar 12, 2010)
[15:00:39] 
[15:00:39] Preparing to commence simulation
[15:00:39] - Looking at optimizations...
[15:00:39] - Created dyn
[15:00:39] - Files status OK
[15:00:40] - Expanded 1797187 -> 2078149 (decompressed 115.6 percent)
[15:00:40] Called DecompressByteArray: compressed_data_size=1797187 data_size=2078149, decompressed_data_size=2078149 diff=0
[15:00:40] - Digital signature verified
[15:00:40] 
[15:00:40] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:00:40] 
[15:00:40] Assembly optimizations on if available.
[15:00:40] Entering M.D.
[15:00:46] Completed 0 out of 500000 steps  (0%)
[15:08:49] Completed 5000 out of 500000 steps  (1%)
[15:16:57] Completed 10000 out of 500000 steps  (2%)
[15:25:03] Completed 15000 out of 500000 steps  (3%)
[15:33:29] Completed 20000 out of 500000 steps  (4%)
[15:44:30] Completed 25000 out of 500000 steps  (5%)
[15:45:47] Caught exception!
[15:45:47] mdrun returned 255
[15:45:47] Going to send back what have done -- stepsTotalG=500000
[15:45:47] Work fraction=0.0512 steps=500000.
[15:45:51] logfile size=13919 infoLength=13919 edr=0 trr=25
[15:45:51] logfile size: 13919 info=13919 bed=0 hdr=25
[15:45:51] Warning: Core could not open logfile work/wudata_09.log; steps completed=500000.
[15:45:51] - Could not open results file
[15:45:51] 
[15:45:51] Folding@home Core Shutdown: EARLY_UNIT_END
[15:45:51] Couldn't delete work/core78.sta.
[15:45:54] CoreStatus = 72 (114)
[15:45:54] Sending work to server
[15:45:54] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:45:54] - Error: Could not get length of results file work/wuresults_09.dat
[15:45:54] - Error: Could not read unit 09 file. Removing from queue.
[15:45:54] Trying to send all finished work units
[15:45:54] Project: 6020 (Run 0, Clone 1, Gen 112)
[15:45:54] - Error: Could not get length of results file work/wuresults_08.dat                     <---- WU 08 gone bad?
[15:45:54] - Error: Could not read unit 08 file. Removing from queue.
[15:45:54] + Sent 0 of 1 completed units to the server
[15:45:54] - Preparing to get new work unit...
[15:45:54] Cleaning up work directory
[15:45:54] + Attempting to get work packet
                 {snip}
[15:57:18] - Attempt #8  to get work failed, and no other work to do.
Waiting before retry.                                                                                                         <------ Hard stop


--- Opening Log file [April 3 17:33:22 UTC] 


# Windows SMP Console Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.29

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\SMP
Executable: C:\SMP\SMP.exe
Arguments: -smp -deino -verbosity 9 

[17:33:22] - Ask before connecting: No
[17:33:22] - User name: anko1 (Team 47815)
[17:33:22] - User ID: 74362D012xxxxx
[17:33:22] - Machine ID: 1
[17:33:22] 
[17:33:22] Loaded queue successfully.
[17:33:22] - Bad Index stored in Work Queue. Has been reset to zero.                                   <---------- Is this a clue?
[17:33:22] - Preparing to get new work unit...
[17:33:22] - Autosending finished units... [April 3 17:33:22 UTC]
[17:33:22] Cleaning up work directory
[17:33:22] Trying to send all finished work units
[17:33:22] + No unsent completed units remaining.
[17:33:22] - Autosend completed
[17:33:22] + Attempting to get work packet
[17:33:22] Passkey found
[17:33:22] - Will indicate memory of 2553 MB
[17:33:22] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 7, Stepping: 10
[17:33:22] - Connecting to assignment server
[17:33:22] Connecting to http://assign.stanford.edu:8080/
[17:33:22] Posted data.
[17:33:22] Initial: ED82; - Successful: assigned to (130.237.232.140).
[17:33:22] + News From Folding@Home: Welcome to Folding@Home
[17:33:23] Loaded queue successfully.
[17:33:23] Connecting to http://130.237.232.140:8080/
[17:33:23] Posted data.
[17:33:23] Initial: 0000; - Receiving payload (expected size: 1797699)
[17:33:28] - Downloaded at ~351 kB/s
[17:33:28] - Averaged speed for that direction ~351 kB/s
[17:33:28] + Received work.
[17:33:28] + Closed connections
[17:33:28] 
[17:33:28] + Processing work unit
[17:33:28] Core required: FahCore_a3.exe
[17:33:28] Core found.
[17:33:28] Working on queue slot 01 [April 3 17:33:28 UTC]
[17:33:28] + Working ...
[17:33:28] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 4 -checkpoint 15 -verbose -lifeline 2448 -version 629'

[17:33:28] 
[17:33:28] *------------------------------*
[17:33:28] Folding@Home Gromacs SMP Core
[17:33:28] Version 2.17 (Mar 12, 2010)
[17:33:28] 
[17:33:28] Preparing to commence simulation
[17:33:28] - Ensuring status. Please wait.
[17:33:38] - Looking at optimizations...
[17:33:38] - Working with standard loops on this execution.
[17:33:38] - Created dyn
[17:33:38] - Files status OK
[17:33:38] - Expanded 1797187 -> 2078149 (decompressed 115.6 percent)
[17:33:38] Called DecompressByteArray: compressed_data_size=1797187 data_size=2078149, decompressed_data_size=2078149 diff=0
[17:33:38] - Digital signature verified
[17:33:38] 
[17:33:38] Project: 6012 (Run 1, Clone 4, Gen 42)
[17:33:38] 
[17:33:38] Entering M.D.
[17:33:45] Completed 0 out of 500000 steps  (0%)
[17:40:30] Completed 5000 out of 500000 steps  (1%)

Don't know why it skipped slot 00. I checked the work folder and there's no wuresults_00 file. However, I still have wuresults_07 [sent] and _08. Also the files for WU 09 are still hanging around. Should I delete them, or will they get cleaned up?

Re: Good WU gone bad?

Posted: Sat Apr 03, 2010 11:13 pm
by Wrish
Work unit 08 ran into an error, the results could not be uploaded, unit was deleted on your end, and therefore the server gave you a fresh copy of the same work unit. Look for system instability.

Try not to delete work files. The client should eventually get around to erasing them.

Re: Good WU gone bad?

Posted: Sat Apr 03, 2010 11:21 pm
by anko1
Actually, I think WU 08 finished OK, but WU 09 went bad and then 08 turned bad, or something:

Code: Select all

[13:32:13] Folding@home Core Shutdown: FINISHED_UNIT
[13:32:15] CoreStatus = 64 (100)
[13:32:16] Unit 8 finished with 87 percent of time to deadline remaining.
[13:32:16] Updated performance fraction: 0.842166
[13:32:16] Sending work to server
[13:32:16] Project: 6020 (Run 0, Clone 1, Gen 112)

Code: Select all

[15:45:47] Caught exception!
[15:45:47] mdrun returned 255
[15:45:47] Going to send back what have done -- stepsTotalG=500000
[15:45:47] Work fraction=0.0512 steps=500000.
[15:45:51] logfile size=13919 infoLength=13919 edr=0 trr=25
[15:45:51] logfile size: 13919 info=13919 bed=0 hdr=25
[15:45:51] Warning: Core could not open logfile work/wudata_09.log; steps completed=500000.
[15:45:51] - Could not open results file
[15:45:51] 
[15:45:51] Folding@home Core Shutdown: EARLY_UNIT_END
[15:45:51] Couldn't delete work/core78.sta.
[15:45:54] CoreStatus = 72 (114)
[15:45:54] Sending work to server
[15:45:54] Project: 6012 (Run 1, Clone 4, Gen 42)
[15:45:54] - Error: Could not get length of results file work/wuresults_09.dat
[15:45:54] - Error: Could not read unit 09 file. Removing from queue.

Code: Select all

[15:45:54] Trying to send all finished work units
[15:45:54] Project: 6020 (Run 0, Clone 1, Gen 112)
[15:45:54] - Error: Could not get length of results file work/wuresults_08.dat
[15:45:54] - Error: Could not read unit 08 file. Removing from queue.
[15:45:54] + Sent 0 of 1 completed units to the server
[15:45:54] - Preparing to get new work unit...

Re: Good WU gone bad?

Posted: Sat Apr 03, 2010 11:48 pm
by Wrish
Oh, I'm sorry, the snips confused me. Those of us who interpret logs can see uploads, early ends, and client crashes. :)

Work unit 09 was the one that crashed out and was handed back to you as work unit 01. There is a snip between the first and second attempts to upload work unit 08 such that I can't tell what the client recorded. There are also errors deleting files that have me wondering whether an antivirus or permissions problem may be affecting the client.

When you get a Bad Index error, though, you should -oneunit whatever work unit you're working on (completing and uploading it), and then delete queue.dat and, optionally, the entire work folder.

If you don't have a network connection at the time of completing a work unit, you can later run the client with the flag "-send all" and it will upload whatever it has and then exit.

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 12:19 am
by P5-133XL
If the WU hasn't actually been successfully uploaded to the server, then yes it is possible for an external event to kill it and thereby fail to get credit. That is a very rare issue however.

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 12:26 am
by anko1
@Wrish and P5-133XL - thanks for the information.

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 12:36 am
by P5-133XL
If the data files are still there, you might try using the qfix utility to see if it can be recovered.

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 12:47 am
by anko1
Ahh, I'd forgotten about qfix. Thanks for the reminder.

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 12:58 am
by P5-133XL
It's alright, I had forgotten about it too, otherwise I would have suggested it earlier when I posted. But then I remembered...

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 6:33 pm
by anko1
I'm using Windows XP, and tried running both Windows/x86 qfix-native and qfix.exe from the link in toTOW's post. I keep getting an "Unknown version of <queue.dat> file" message. Any idea what I'm doing wrong? I noticed toTOW's message about an update for v.6, but don't see anything more recent there. Thanks for your help.

Re: Good WU gone bad?

Posted: Sun Apr 04, 2010 8:59 pm
by P5-133XL
Dick Howell, the original developer died and is no longer maintaining it. Another person took over development and this is his web site: LinuxMinded. It might have a more recent version than the one here and failing all else, there is a link to E-mail him from there.