Project: 5736 (Run 3, Clone 515, Gen 119)

Moderators: Site Moderators, FAHC Science Team

Post Reply
More_Fiber
Posts: 9
Joined: Sat Sep 11, 2010 9:48 pm

Project: 5736 (Run 3, Clone 515, Gen 119)

Post by More_Fiber »

I'm not sure if this is a WU issue or a GPU issue.

This WU seemed to hang at 64% for several hours.
Normally the GPU will process 1% of a WU in about 5 minutes and is supposed to checkpoint every 15 minutes.
After 90 minutes, I didn't see any evidence of progress or checkpoints.
I then shutdown and restarted the client multiple times (each time waiting 1.5-2 hours) and saw the same lack of progress.

Am I jumping to the conclusion that it's hung prematurely, or if it's truely hung, how can I tell if it is a WU or GPU issue?

GPU: ATI Radeon 4870
Catalyst Version: 09.11
Windows XP SP3

  • --- Opening Log file [September 11 08:47:58 UTC]
    [08:47:58]
    [08:47:58] Loaded queue successfully.
    [08:47:58] Initialization complete
    [08:47:58]
    [08:47:58] + Processing work unit
    [08:47:58] Core required: FahCore_11.exe
    [08:47:58] Core found.
    [08:47:58] Working on queue slot 04 [September 11 08:47:58 UTC]
    [08:47:58] + Working ...
    [08:47:58]
    [08:47:58] *------------------------------*
    [08:47:58] Folding@Home GPU Core - Beta
    [08:47:58] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [08:47:58]
    [08:47:58] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [08:47:58] Build host: amoeba
    [08:47:58] Board Type: AMD
    [08:47:58] Core :
    [08:47:58] Preparing to commence simulation
    [08:47:58] - Ensuring status. Please wait.
    [08:48:07] - Looking at optimizations...
    [08:48:07] - Working with standard loops on this execution.
    [08:48:07] - Previous termination of core was improper.
    [08:48:07] - Files status OK
    [08:48:07] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [08:48:07] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [08:48:07] - Digital signature verified
    [08:48:07]
    [08:48:07] Project: 5736 (Run 3, Clone 515, Gen 119)
    [08:48:07]
    [08:48:07] Entering M.D.
    [08:48:13] Will resume from checkpoint file
    [08:48:13] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [08:48:14] Working on Protein
    [08:48:14] Client config found, loading data.
    [08:48:14] Starting GUI Server
    [08:48:19] Resuming from checkpoint
    [08:48:19] fcCheckPointResume: retreived and current tpr file hash:
    [08:48:19] 0 1445190852 1445190852
    [08:48:19] 1 3527609112 3527609112
    [08:48:19] 2 2623324236 2623324236
    [08:48:19] 3 1655012693 1655012693
    [08:48:19] 4 199698481 199698481
    [08:48:19] Verified work/wudata_04.log
    [08:48:19] Verified work/wudata_04.edr
    [08:48:19] Verified work/wudata_04.xtc
    [08:48:19] Completed 15%

    ---snip---

    [13:25:35] Completed 63%
    [13:30:13] Completed 64%
    [14:47:58] + Working...

    !!! 1 hours 45 minutes no progress

    Folding@Home Client Shutdown.


    --- Opening Log file [September 11 15:18:53 UTC]


    [15:18:53]
    [15:18:53] Loaded queue successfully.
    [15:18:53] Initialization complete
    [15:18:53]
    [15:18:53] + Processing work unit
    [15:18:53] Core required: FahCore_11.exe
    [15:18:53] Core found.
    [15:18:53] Working on queue slot 04 [September 11 15:18:53 UTC]
    [15:18:53] + Working ...
    [15:18:54]
    [15:18:54] *------------------------------*
    [15:18:54] Folding@Home GPU Core - Beta
    [15:18:54] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [15:18:54]
    [15:18:54] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [15:18:54] Build host: amoeba
    [15:18:54] Board Type: AMD
    [15:18:54] Core :
    [15:18:54] Preparing to commence simulation
    [15:18:54] - Looking at optimizations...
    [15:18:54] - Files status OK
    [15:18:54] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [15:18:54] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [15:18:54] - Digital signature verified
    [15:18:54]
    [15:18:54] Project: 5736 (Run 3, Clone 515, Gen 119)
    [15:18:54]
    [15:19:01] Assembly optimizations on if available.
    [15:19:01] Entering M.D.
    [15:19:15] Will resume from checkpoint file
    [15:19:15] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [15:19:28] Working on Protein
    [15:20:01] Client config found, loading data.
    [15:20:05] Starting GUI Server
    [15:25:59] Resuming from checkpoint
    [15:25:59] fcCheckPointResume: retreived and current tpr file hash:
    [15:25:59] 0 1445190852 1445190852
    [15:25:59] 1 3527609112 3527609112
    [15:25:59] 2 2623324236 2623324236
    [15:25:59] 3 1655012693 1655012693
    [15:25:59] 4 199698481 199698481
    [15:25:59] Verified work/wudata_04.log
    [15:25:59] Verified work/wudata_04.edr
    [15:26:01] Verified work/wudata_04.xtc
    [15:26:07] Completed 64%

    !!! 1 hour 20 minutes no progress

    Folding@Home Client Shutdown.


    --- Opening Log file [September 11 16:49:09 UTC]


    [16:49:09]
    [16:49:09] Loaded queue successfully.
    [16:49:09] Initialization complete
    [16:49:09]
    [16:49:09] + Processing work unit
    [16:49:09] Core required: FahCore_11.exe
    [16:49:09] Core found.
    [16:49:09] Working on queue slot 04 [September 11 16:49:09 UTC]
    [16:49:09] + Working ...
    [16:49:09]
    [16:49:09] *------------------------------*
    [16:49:09] Folding@Home GPU Core - Beta
    [16:49:09] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [16:49:09]
    [16:49:09] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [16:49:09] Build host: amoeba
    [16:49:09] Board Type: AMD
    [16:49:09] Core :
    [16:49:09] Preparing to commence simulation
    [16:49:09] - Looking at optimizations...
    [16:49:09] - Files status OK
    [16:49:09] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [16:49:09] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [16:49:09] - Digital signature verified
    [16:49:09]
    [16:49:09] Project: 5736 (Run 3, Clone 515, Gen 119)
    [16:49:09]
    [16:49:30] Assembly optimizations on if available.
    [16:49:30] Entering M.D.
    [16:49:36] Will resume from checkpoint file
    [16:49:41] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [16:50:22] Working on Protein
    [16:50:48] Client config found, loading data.
    [16:50:49] Starting GUI Server
    [16:57:34] Resuming from checkpoint
    [16:57:34] fcCheckPointResume: retreived and current tpr file hash:
    [16:57:34] 0 1445190852 1445190852
    [16:57:34] 1 3527609112 3527609112
    [16:57:34] 2 2623324236 2623324236
    [16:57:34] 3 1655012693 1655012693
    [16:57:34] 4 199698481 199698481
    [16:57:39] Verified work/wudata_04.log
    [16:57:39] Verified work/wudata_04.edr
    [16:57:39] Verified work/wudata_04.xtc
    [16:57:40] Completed 64%

    !!! 2 1/2 hours, no progress

    --- Opening Log file [September 11 19:37:08 UTC]


    [19:37:08]
    [19:37:09] Loaded queue successfully.
    [19:37:09] Initialization complete
    [19:37:09]
    [19:37:09] + Processing work unit
    [19:37:09] Core required: FahCore_11.exe
    [19:37:09] Core found.
    [19:37:09] Working on queue slot 04 [September 11 19:37:09 UTC]
    [19:37:09] + Working ...
    [19:37:10]
    [19:37:10] *------------------------------*
    [19:37:10] Folding@Home GPU Core - Beta
    [19:37:10] Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
    [19:37:10]
    [19:37:10] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
    [19:37:10] Build host: amoeba
    [19:37:10] Board Type: AMD
    [19:37:10] Core :
    [19:37:10] Preparing to commence simulation
    [19:37:10] - Ensuring status. Please wait.
    [19:37:20] - Looking at optimizations...
    [19:37:20] - Working with standard loops on this execution.
    [19:37:20] - Previous termination of core was improper.
    [19:37:20] - Files status OK
    [19:37:20] - Expanded 96714 -> 489152 (decompressed 505.7 percent)
    [19:37:20] Called DecompressByteArray: compressed_data_size=96714 data_size=489152, decompressed_data_size=489152 diff=0
    [19:37:20] - Digital signature verified
    [19:37:20]
    [19:37:20] Project: 5736 (Run 3, Clone 515, Gen 119)
    [19:37:20]
    [19:37:33] Entering M.D.
    [19:37:39] Will resume from checkpoint file
    [19:37:39] Tpr hash work/wudata_04.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [19:38:22] Working on Protein
    [19:38:41] Client config found, loading data.
    [19:38:46] Starting GUI Server
    [19:45:22] Resuming from checkpoint
    [19:45:22] fcCheckPointResume: retreived and current tpr file hash:
    [19:45:22] 0 1445190852 1445190852
    [19:45:22] 1 3527609112 3527609112
    [19:45:22] 2 2623324236 2623324236
    [19:45:22] 3 1655012693 1655012693
    [19:45:22] 4 199698481 199698481
    [19:45:28] Verified work/wudata_04.log
    [19:45:28] Verified work/wudata_04.edr
    [19:45:31] Verified work/wudata_04.xtc
    [19:45:32] Completed 64%
    [19:45:32] mdrun_gpu returned
    [19:45:32] Calculated & specified T inconsisitent
    [19:45:32]
    [19:45:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
    [19:45:46] CoreStatus = 7A (122)
    [19:45:46] Sending work to server
    [19:45:46] Project: 5736 (Run 3, Clone 515, Gen 119)
    [19:45:46] - Read packet limit of 540015616... Set to 524286976.


    [19:45:46] + Attempting to send results [September 11 19:45:46 UTC]
    [19:45:46] - Error: Could not read results file work/wuresults_04.dat from disk
    [19:45:46] - Error: Could not read unit 04 file. Removing from queue.
    [19:45:46] - Preparing to get new work unit...
    [19:45:46] + Attempting to get work packet
    [19:45:46] - Connecting to assignment server
    [19:45:47] - Successful: assigned to (171.64.65.102).
    [19:45:47] + News From Folding@Home: Welcome to Folding@Home
    [19:45:47] Loaded queue successfully.

    ---snip---

    [19:45:53] - Digital signature verified
    [19:45:54]
    [19:45:54] Project: 5736 (Run 3, Clone 515, Gen 119)

    !!! Same WU

    [19:45:54]
    [19:46:02] Assembly optimizations on if available.
    [19:46:02] Entering M.D.
    [19:46:08] Tpr hash work/wudata_05.tpr: 1445190852 3527609112 2623324236 1655012693 199698481
    [19:46:55] Working on Protein
    [19:47:15] Client config found, loading data.
    [19:47:21] Starting GUI Server

    !!! Over 2 hours - no progress

    Folding@Home Client Shutdown.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by PantherX »

It processed the WU from 1% -> 64% then hung. Couple of attempts later, it gave you an error:
[19:45:32] Completed 64%
[19:45:32] mdrun_gpu returned
[19:45:32] Calculated & specified T inconsisitent
[19:45:32]
[19:45:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:45:46] CoreStatus = 7A (122)

Then it could read the wuresults (not sure why) and then you were assigned the same WU but this time, it hung at 0% I am guessing that something has changed in your setup because if it was a bad WU, it would error at the same place 64% but you may have a unique WU that gives different error. Nuke the queue.dat file, work folder and see if you are assigned a different WU and if you can process it.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
More_Fiber
Posts: 9
Joined: Sat Sep 11, 2010 9:48 pm

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by More_Fiber »

I deleted the queue.dat file and the work folder, restarted the GPU client and got the same #$%^ WU again, and it hung again.
THIS IS CRAP.

Now I understand when I look at donor statistics, why I see so many that have dropped out.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by PantherX »

Changing the Machine ID might give you another WU.

BTW, I have folded ~3750 WUs and only had <10 WUs which were bad! The probability of getting them is very low (at least for me). However, if you so get another WU and if the same thing happens, then I suggest that you check your setup. Although, I once got 2 Bad WUs in a row on the Classic Client so checking that you have everything configured alright might be helpful in eliminating suggestions which will lead to the real problem which can hopefully be solved.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by MtM »

More_Fiber,

Forum readability goes up if you put the log entries in

Code: Select all

[code] 
[/code] tags ( prevents the heavy scrolling needed now ) :)

As Panter_X said, it's not likely to get the same wu which errors at diffrent points if your configuration is 100% stable, and it looks like you might have an interfearing process or a card which changes clocks while folding ( this can also cause 'hung' like status atleast it did in the past for me on nvidia cards which switched 2d/3d low power/3d clocks while processing a work unit ).

It might help to put your card in fixed mode ( 3d offcourse ).

Also, for the first 10 or so wu's please run your card at stock clocks ( you don't mention anything about it, but I'm guessing you might have it at an overclock you think is stable since 3d rendering seems stable.. however 3d rendering != folding ).

Also, to help with debugging, add -verbosity 9 to the extra paramaters and do not remove snippets from the log which you think are not relevant. Leave that to the people here, allot of times people ommit things which are a tell tale sign of things which went wrong :)

So quick rehash:
  • Set stock clocks
  • Set fixed clocks
  • Delete everything except the client, rerun the config process ( don't forget -verbosity 9 !! and optionally a new machine ID
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by toTOW »

GPU WUs will be assigned 6 times before the server understands that you can't fold it and move to another one ... so you might have to repeat the delete procedure 6 times.

There is only one report for this WU in the DB, and it looks like an immediate EUE ... I think it's safe to consider it as a bad WU.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by MtM »

toTOW wrote:GPU WUs will be assigned 6 times before the server understands that you can't fold it and move to another one ... so you might have to repeat the delete procedure 6 times.

There is only one report for this WU in the DB, and it looks like an immediate EUE ... I think it's safe to consider it as a bad WU.
But he got to 64% right, not 0%.. atleast on one try :e?:

Edit: not saying you're wrong, but since it's a new user ( 2 posts, can't see if he returned wu's before but you should be able to ) and the two diffrent points on which he errord out I thought starting with some basics would be 'best' :)

And if it would be his entry in the DB, I think it's safe to ignore it for now as it looks like it could just as well be a problem with his setup/configuration?
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by toTOW »

I didn't say it was his results in the DB ... in fact, with the error I see in the log, he never returned any results because of this :
[19:45:32] Completed 64%
[19:45:32] mdrun_gpu returned
[19:45:32] Calculated & specified T inconsisitent
[19:45:32]
[19:45:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
[19:45:46] CoreStatus = 7A (122)
[19:45:46] Sending work to server
[19:45:46] Project: 5736 (Run 3, Clone 515, Gen 119)
[19:45:46] - Read packet limit of 540015616... Set to 524286976.


[19:45:46] + Attempting to send results [September 11 19:45:46 UTC]
[19:45:46] - Error: Could not read results file work/wuresults_04.dat from disk
[19:45:46] - Error: Could not read unit 04 file. Removing from queue.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by MtM »

Yeah sorry you're right :)
More_Fiber
Posts: 9
Joined: Sat Sep 11, 2010 9:48 pm

Re: Project: 5736 (Run 3, Clone 515, Gen 119)

Post by More_Fiber »

MtM wrote:Also, for the first 10 or so wu's please run your card at stock clocks ( you don't mention anything about it, but I'm guessing you might have it at an overclock
Not overclocked. 105 GPU WU submitted with no issues. First post here since this is the first issue I've had that I couldn't resolve through FAQs, etc.
Should not have been any apps running that would have changed settings on the GPU. I alway shut down the client when running games.
toTOW wrote:GPU WUs will be assigned 6 times before the server understands that you can't fold it and move to another one.
It's good to know that there is a limit to the number of times that a WU will be resent, although 6 seems high and certainly gave the impression of no limit.
3 seems like a more reasonable number of attempts to the same user. I was getting really frustrated when the assignment server kept sending the same WU.
Post Reply