Page 1 of 1

Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sat Sep 11, 2010 10:30 pm
by More_Fiber
I'm not sure if this is a WU issue or a GPU issue.

This WU seemed to hang at 64% for several hours.
Normally the GPU will process 1% of a WU in about 5 minutes and is supposed to checkpoint every 15 minutes.
After 90 minutes, I didn't see any evidence of progress or checkpoints.
I then shutdown and restarted the client multiple times (each time waiting 1.5-2 hours) and saw the same lack of progress.

Am I jumping to the conclusion that it's hung prematurely, or if it's truely hung, how can I tell if it is a WU or GPU issue?

GPU: ATI Radeon 4870
Catalyst Version: 09.11
Windows XP SP3

  • --- Opening Log file [September 11 08:47:58 UTC]
    [08:47:58]
    [08:47:58] Loaded queue successfully.
    [08:47:58] Initialization complete
    [08:47:58]
    [08:47:58] + Processing work unit
    [08:47:58] Core required: FahCore_11.exe
    [08:47:58] Core found.
    [08:47:58] Working on queue slot 04 [September 11 08:47:58 UTC]
    [08:47:58] + Working ...
    [08:47:58]
    [08:47:58] *------------------------------*
    [08:47:58] Folding@Home GPU Core - Beta
    [08:47:58] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [08:47:58]
    [08:47:58] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [08:47:58] Build host: amoeba
    [08:47:58] Board Type: AMD
    [08:47:58] Core :
    [08:47:58] Preparing to commence simulation
    [08:47:58] - Ensuring status. Please wait.
    [08:48:07] - Looking at optimizations...
    [08:48:07] - Working with standard loops on this execution.
    [08:48:07] - Previous termination of core was improper.
    [08:48:07] - Files status OK
    [08:48:07] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [08:48:07] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [08:48:07] - Digital signature verified
    [08:48:07]
    [08:48:07] Project: 5736 (Run 3, Clone 515, Gen 119)
    [08:48:07]
    [08:48:07] Entering M.D.
    [08:48:13] Will resume from checkpoint file
    [08:48:13] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [08:48:14] Working on Protein
    [08:48:14] Client config found, loading data.
    [08:48:14] Starting GUI Server
    [08:48:19] Resuming from checkpoint
    [08:48:19] fcCheckPointResume: retreived and current tpr file hash:
    [08:48:19] 0 1445190852 1445190852
    [08:48:19] 1 3527609112 3527609112
    [08:48:19] 2 2623324236 2623324236
    [08:48:19] 3 1655012693 1655012693
    [08:48:19] 4 199698481 199698481
    [08:48:19] Verified work/wudata_04.log
    [08:48:19] Verified work/wudata_04.edr
    [08:48:19] Verified work/wudata_04.xtc
    [08:48:19] Completed 15%

    ---snip---

    [13:25:35] Completed 63%
    [13:30:13] Completed 64%
    [14:47:58] + Working...

    !!! 1 hours 45 minutes no progress

    Folding@Home Client Shutdown.


    --- Opening Log file [September 11 15:18:53 UTC]


    [15:18:53]
    [15:18:53] Loaded queue successfully.
    [15:18:53] Initialization complete
    [15:18:53]
    [15:18:53] + Processing work unit
    [15:18:53] Core required: FahCore_11.exe
    [15:18:53] Core found.
    [15:18:53] Working on queue slot 04 [September 11 15:18:53 UTC]
    [15:18:53] + Working ...
    [15:18:54]
    [15:18:54] *------------------------------*
    [15:18:54] Folding@Home GPU Core - Beta
    [15:18:54] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [15:18:54]
    [15:18:54] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [15:18:54] Build host: amoeba
    [15:18:54] Board Type: AMD
    [15:18:54] Core :
    [15:18:54] Preparing to commence simulation
    [15:18:54] - Looking at optimizations...
    [15:18:54] - Files status OK
    [15:18:54] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [15:18:54] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [15:18:54] - Digital signature verified
    [15:18:54]
    [15:18:54] Project: 5736 (Run 3, Clone 515, Gen 119)
    [15:18:54]
    [15:19:01] Assembly optimizations on if available.
    [15:19:01] Entering M.D.
    [15:19:15] Will resume from checkpoint file
    [15:19:15] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [15:19:28] Working on Protein
    [15:20:01] Client config found, loading data.
    [15:20:05] Starting GUI Server
    [15:25:59] Resuming from checkpoint
    [15:25:59] fcCheckPointResume: retreived and current tpr file hash:
    [15:25:59] 0 1445190852 1445190852
    [15:25:59] 1 3527609112 3527609112
    [15:25:59] 2 2623324236 2623324236
    [15:25:59] 3 1655012693 1655012693
    [15:25:59] 4 199698481 199698481
    [15:25:59] Verified work/wudata_04.log
    [15:25:59] Verified work/wudata_04.edr
    [15:26:01] Verified work/wudata_04.xtc
    [15:26:07] Completed 64%

    !!! 1 hour 20 minutes no progress

    Folding@Home Client Shutdown.


    --- Opening Log file [September 11 16:49:09 UTC]


    [16:49:09]
    [16:49:09] Loaded queue successfully.
    [16:49:09] Initialization complete
    [16:49:09]
    [16:49:09] + Processing work unit
    [16:49:09] Core required: FahCore_11.exe
    [16:49:09] Core found.
    [16:49:09] Working on queue slot 04 [September 11 16:49:09 UTC]
    [16:49:09] + Working ...
    [16:49:09]
    [16:49:09] *------------------------------*
    [16:49:09] Folding@Home GPU Core - Beta
    [16:49:09] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [16:49:09]
    [16:49:09] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [16:49:09] Build host: amoeba
    [16:49:09] Board Type: AMD
    [16:49:09] Core :
    [16:49:09] Preparing to commence simulation
    [16:49:09] - Looking at optimizations...
    [16:49:09] - Files status OK
    [16:49:09] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [16:49:09] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [16:49:09] - Digital signature verified
    [16:49:09]
    [16:49:09] Project: 5736 (Run 3, Clone 515, Gen 119)
    [16:49:09]
    [16:49:30] Assembly optimizations on if available.
    [16:49:30] Entering M.D.
    [16:49:36] Will resume from checkpoint file
    [16:49:41] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [16:50:22] Working on Protein
    [16:50:48] Client config found, loading data.
    [16:50:49] Starting GUI Server
    [16:57:34] Resuming from checkpoint
    [16:57:34] fcCheckPointResume: retreived and current tpr file hash:
    [16:57:34] 0 1445190852 1445190852
    [16:57:34] 1 3527609112 3527609112
    [16:57:34] 2 2623324236 2623324236
    [16:57:34] 3 1655012693 1655012693
    [16:57:34] 4 199698481 199698481
    [16:57:39] Verified work/wudata_04.log
    [16:57:39] Verified work/wudata_04.edr
    [16:57:39] Verified work/wudata_04.xtc
    [16:57:40] Completed 64%

    !!! 2 1/2 hours, no progress

    --- Opening Log file [September 11 19:37:08 UTC]


    [19:37:08]
    [19:37:09] Loaded queue successfully.
    [19:37:09] Initialization complete
    [19:37:09]
    [19:37:09] + Processing work unit
    [19:37:09] Core required: FahCore_11.exe
    [19:37:09] Core found.
    [19:37:09] Working on queue slot 04 [September 11 19:37:09 UTC]
    [19:37:09] + Working ...
    [19:37:10]
    [19:37:10] *------------------------------*
    [19:37:10] Folding@Home GPU Core - Beta
    [19:37:10] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [19:37:10]
    [19:37:10] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [19:37:10] Build host: amoeba
    [19:37:10] Board Type: AMD
    [19:37:10] Core :
    [19:37:10] Preparing to commence simulation
    [19:37:10] - Ensuring status. Please wait.
    [19:37:20] - Looking at optimizations...
    [19:37:20] - Working with standard loops on this execution.
    [19:37:20] - Previous termination of core was improper.
    [19:37:20] - Files status OK
    [19:37:20] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [19:37:20] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [19:37:20] - Digital signature verified
    [19:37:20]
    [19:37:20] Project: 5736 (Run 3, Clone 515, Gen 119)
    [19:37:20]
    [19:37:33] Entering M.D.
    [19:37:39] Will resume from checkpoint file
    [19:37:39] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [19:38:22] Working on Protein
    [19:38:41] Client config found, loading data.
    [19:38:46] Starting GUI Server
    [19:45:22] Resuming from checkpoint
    [19:45:22] fcCheckPointResume: retreived and current tpr file hash:
    [19:45:22] 0 1445190852 1445190852
    [19:45:22] 1 3527609112 3527609112
    [19:45:22] 2 2623324236 2623324236
    [19:45:22] 3 1655012693 1655012693
    [19:45:22] 4 199698481 199698481
    [19:45:28] Verified work/wudata_04.log
    [19:45:28] Verified work/wudata_04.edr
    [19:45:31] Verified work/wudata_04.xtc
    [19:45:32] Completed 64%
    [19:45:32] mdrun_gpu returned
    [19:45:32] Calculated & specified T inconsisitent
    [19:45:32]
    [19:45:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
    [19:45:46] CoreStatus = 7A (122)
    [19:45:46] Sending work to server
    [19:45:46] Project: 5736 (Run 3, Clone 515, Gen 119)
    [19:45:46] - Read packet limit of 540015616... Set to 524286976.


    [19:45:46] + Attempting to send results [September 11 19:45:46 UTC]
    [19:45:46] - Error: Could not read results file work/wuresults_04.dat from disk
    [19:45:46] - Error: Could not read unit 04 file. Removing from queue.
    [19:45:46] - Preparing to get new work unit...
    [19:45:46] + Attempting to get work packet
    [19:45:46] - Connecting to assignment server
    [19:45:47] - Successful: assigned to (171.64.65.102).
    [19:45:47] + News From Folding@Home: Welcome to Folding@Home
    [19:45:47] Loaded queue successfully.

    ---snip---

    [19:45:53] - Digital signature verified
    [19:45:54]
    [19:45:54] Project: 5736 (Run 3, Clone 515, Gen 119)

    !!! Same WU

    [19:45:54]
    [19:46:02] Assembly optimizations on if available.
    [19:46:02] Entering M.D.
    [19:46:08] Tpr hash work/wudata_05.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [19:46:55] Working on Protein
    [19:47:15] Client config found, loading data.
    [19:47:21] Starting GUI Server

    !!! Over 2 hours - no progress

    Folding@Home Client Shutdown.

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 3:52 am
by PantherX
It processed the WU from 1% -> 64% then hung. Couple of attempts later, it gave you an error:
[19:45:32] Completed 64%
[19:45:32] mdrun_gpu returned
[19:45:32] Calculated & specified T inconsisitent
[19:45:32]
[19:45:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:45:46] CoreStatus = 7A (122)

Then it could read the wuresults (not sure why) and then you were assigned the same WU but this time, it hung at 0% I am guessing that something has changed in your setup because if it was a bad WU, it would error at the same place 64% but you may have a unique WU that gives different error. Nuke the queue.dat file, work folder and see if you are assigned a different WU and if you can process it.

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 8:36 am
by More_Fiber
I deleted the queue.dat file and the work folder, restarted the GPU client and got the same #$%^ WU again, and it hung again.
THIS IS CRAP.

Now I understand when I look at donor statistics, why I see so many that have dropped out.

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 9:10 am
by PantherX
Changing the Machine ID might give you another WU.

BTW, I have folded ~3750 WUs and only had <10 WUs which were bad! The probability of getting them is very low (at least for me). However, if you so get another WU and if the same thing happens, then I suggest that you check your setup. Although, I once got 2 Bad WUs in a row on the Classic Client so checking that you have everything configured alright might be helpful in eliminating suggestions which will lead to the real problem which can hopefully be solved.

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 9:30 am
by MtM
More_Fiber,

Forum readability goes up if you put the log entries in

Code: Select all

[code] 
[/code] tags ( prevents the heavy scrolling needed now ) :)

As Panter_X said, it's not likely to get the same wu which errors at diffrent points if your configuration is 100% stable, and it looks like you might have an interfearing process or a card which changes clocks while folding ( this can also cause 'hung' like status atleast it did in the past for me on nvidia cards which switched 2d/3d low power/3d clocks while processing a work unit ).

It might help to put your card in fixed mode ( 3d offcourse ).

Also, for the first 10 or so wu's please run your card at stock clocks ( you don't mention anything about it, but I'm guessing you might have it at an overclock you think is stable since 3d rendering seems stable.. however 3d rendering != folding ).

Also, to help with debugging, add -verbosity 9 to the extra paramaters and do not remove snippets from the log which you think are not relevant. Leave that to the people here, allot of times people ommit things which are a tell tale sign of things which went wrong :)

So quick rehash:
  • Set stock clocks
  • Set fixed clocks
  • Delete everything except the client, rerun the config process ( don't forget -verbosity 9 !! and optionally a new machine ID

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 9:54 am
by toTOW
GPU WUs will be assigned 6 times before the server understands that you can't fold it and move to another one ... so you might have to repeat the delete procedure 6 times.

There is only one report for this WU in the DB, and it looks like an immediate EUE ... I think it's safe to consider it as a bad WU.

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 9:59 am
by MtM
toTOW wrote:GPU WUs will be assigned 6 times before the server understands that you can't fold it and move to another one ... so you might have to repeat the delete procedure 6 times.

There is only one report for this WU in the DB, and it looks like an immediate EUE ... I think it's safe to consider it as a bad WU.
But he got to 64% right, not 0%.. atleast on one try :e?:

Edit: not saying you're wrong, but since it's a new user ( 2 posts, can't see if he returned wu's before but you should be able to ) and the two diffrent points on which he errord out I thought starting with some basics would be 'best' :)

And if it would be his entry in the DB, I think it's safe to ignore it for now as it looks like it could just as well be a problem with his setup/configuration?

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 10:20 am
by toTOW
I didn't say it was his results in the DB ... in fact, with the error I see in the log, he never returned any results because of this :
[19:45:32] Completed 64%
[19:45:32] mdrun_gpu returned
[19:45:32] Calculated & specified T inconsisitent
[19:45:32]
[19:45:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:45:46] CoreStatus = 7A (122)
[19:45:46] Sending work to server
[19:45:46] Project: 5736 (Run 3, Clone 515, Gen 119)
[19:45:46] - Read packet limit of 540015616... Set to 524286976.


[19:45:46] + Attempting to send results [September 11 19:45:46 UTC]
[19:45:46] - Error: Could not read results file work/wuresults_04.dat from disk
[19:45:46] - Error: Could not read unit 04 file. Removing from queue.

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 10:22 am
by MtM
Yeah sorry you're right :)

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Posted: Sun Sep 12, 2010 5:06 pm
by More_Fiber
MtM wrote:Also, for the first 10 or so wu's please run your card at stock clocks ( you don't mention anything about it, but I'm guessing you might have it at an overclock
Not overclocked. 105 GPU WU submitted with no issues. First post here since this is the first issue I've had that I couldn't resolve through FAQs, etc.
Should not have been any apps running that would have changed settings on the GPU. I alway shut down the client when running games.
toTOW wrote:GPU WUs will be assigned 6 times before the server understands that you can't fold it and move to another one.
It's good to know that there is a limit to the number of times that a WU will be resent, although 6 seems high and certainly gave the impression of no limit.
3 seems like a more reasonable number of attempts to the same user. I was getting really frustrated when the assignment server kept sending the same WU.