Page 1 of 1
Unstable Machine/Hanging P10501
Posted: Thu Jun 21, 2012 12:53 pm
by GenDrexler
Of late, I've been getting this WU (P10501 R315, C0, G352) repeatedly, which almost always results in a unstable machine, hung client, or if it does actually complete the unit, the core terminates improperly leaving multiple instances running - which conflict with each other, preventing any of them from running properly.
I've observed this behaviour for every P10501 WU, over a couple months(deleting work folder, queue etc usually did the trick on getting a different WU), on different gpus in different systems using various driver versions, and occurs whether running overclocked or at stock - it's the only WU that fails or otherwise gives issues. Recently, even after doing the trouble shooting steps, the client keeps receiving the same WU.
Log file:
https://dl.dropbox.com/u/5364921/FAHlog.txt - from system two
Client v6.41 Core 11 (non-Fermi)
Drivers tried: 286, 266, 290.53(most stable), 295 - newer drivers have proved to be unstable/unreliable
OS; Win 7 Ult x64
System One: Toshiba x505 - i7-720qm, GTS 250m, 8gb ram
System Two: MSI G31TM-P21, Q9400, 4Gb ram, 9800GTX
System Three: ASUS P8Z68 Deluxe, i7-2600k, 16GB ram, GTX 570, 9500GT
Re: Unstable Machine/Hanging P10501
Posted: Thu Jun 21, 2012 7:30 pm
by bruce
Welcome to foldingforum.org, GenDrexler.
Mixing a Fermi GPU and a non-Fermi GPU (system 3) is rather difficult to do in V6 and greatly improved but still not perfect in V7. Have you considered upgrading to the V7 client? That would simplify the debugging process and is very likely to work immediately on installation. (You'll only need one installed client per system.)
(If you have a strong reason NOT to upgrade, we can still try to support your systems.)
Re: Unstable Machine/Hanging P10501
Posted: Thu Jun 21, 2012 7:57 pm
by 7im
V7 and the 301.xx drivers, as reported by other people here in the forum, seem to work very well for most people on most hardware, and actually fixes some power saving mode/crashing issues.
Re: Unstable Machine/Hanging P10501
Posted: Fri Jun 22, 2012 6:15 am
by GenDrexler
Thanks.
Fermi and non-Fermi has been working pretty well for me on v6, it didn't work at all with the 9800GTX and 570, but the 9500GT and 570 works like a charm, just this one particular unit that gives, in v7 however doesn't work at all last I tried bout a month or so ago.
I'm currently only gpu folding on system two with the 9800GTX, but I've experienced this issue pretty much every time I've gotten this particular project on all three in the past; usually I'd just delete the work data and it would be on its merry way, and maybe see this project once a month(folding 24/7) but for the past week, it's repeatedly been receiving these 10501 projects.
Which makes me think its the WU itself since they fail to even do a single frame most of the time, but not seeing anyone else having any issues with it.
I think I did try 295/296 WHQL a while back but had stability issues so switched back to 290.53. Did update to the latest (301.42), and getting the same thing again, but with a 10503 (this is for the 9800GTX@stock):
Code: Select all
[05:01:49] Folding@Home GPU Core
[05:01:49] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[05:01:49]
[05:01:49] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[05:01:49] Build host: amoeba
[05:01:49] Board Type: Nvidia
[05:01:49] Core :
[05:01:49] Preparing to commence simulation
[05:01:49] - Looking at optimizations...
[05:01:49] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[05:01:49] - Created dyn
[05:01:49] - Files status OK
[05:01:49] - Expanded 62896 -> 336799 (decompressed 535.4 percent)
[05:01:49] Called DecompressByteArray: compressed_data_size=62896 data_size=336799, decompressed_data_size=336799 diff=0
[05:01:49] - Digital signature verified
[05:01:49]
[05:01:49] Project: 10503 (Run 61, Clone 0, Gen 194)
[05:01:49]
[05:01:49] Assembly optimizations on if available.
[05:01:49] Entering M.D.
[05:01:55] Tpr hash work/wudata_01.tpr: 3529562172 3972086480 3049920976 2911844281 64716392
[05:01:55]
[05:01:55] Calling fah_main args: 14 usage=100
[05:01:55]
[05:02:02] Working on Protein
[05:02:03] mdrun_gpu returned
[05:02:03] Going to send back what have done -- stepsTotalG=0
[05:02:03] Work fraction=0.0000 steps=0.
[05:02:07] logfile size=1245 infoLength=1245 edr=0 trr=25
[05:02:07] + Opened results file
[05:02:07] - Writing 1783 bytes of core data to disk...
[05:02:07] Done: 1271 -> 714 (compressed to 56.1 percent)
[05:02:07] ... Done.
[05:02:07] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[05:02:10]
[05:02:10] Folding@home Core Shutdown: UNSTABLE_MACHINE
Again, I only have this issue with this particular project and all other WUs/projects run perfectly fine.
MemTestG80 (200 iteratations w/200Mb) reports no errors at stock or max OC.
I could switch to v7, but won't know if it works until I get one of these units again, currently folding a P5766 in v7.
Re: Unstable Machine/Hanging P10501
Posted: Fri Jun 22, 2012 11:16 am
by PantherX
The WU isn't a bad one was it was successfully completed by another donor as shown below:
Your WU (P10503 R61 C0 G194) was added to the stats database on 2012-06-21 14:00:54 for 587 points of credit.
Hopefully, the V7 might solve this issue. Also, how exactly are you installing/uninstalling the Nvidia drivers?
Re: Unstable Machine/Hanging P10501
Posted: Fri Jun 22, 2012 4:17 pm
by GenDrexler
Add/remove - uninstall, restart, driver sweep, restart, install new drivers.
Ok so overnight it did pick up a P10504 in v7 (had stopped it at 91%), also completed a P5766 and a P7xxx WU.
Code: Select all
16:09:19:WU00:FS00:0x11:
16:09:19:WU00:FS00:0x11:Project: 10504 (Run 418, Clone 0, Gen 594)
16:09:19:WU00:FS00:0x11:
16:09:19:WU00:FS00:0x11:Assembly optimizations on if available.
16:09:19:WU00:FS00:0x11:Entering M.D.
16:09:22:FS00:Finishing
16:09:24:WU00:FS00:0x11:Will resume from checkpoint file
16:09:24:WU00:FS00:0x11:Tpr hash 00/wudata_01.tpr: 340528376 1656272085 3166580509 3551419480 1039727158
16:09:24:WU00:FS00:0x11:
16:09:24:WU00:FS00:0x11:Calling fah_main args: 14 usage=100
16:09:24:WU00:FS00:0x11:
16:09:25:WU00:FS00:0x11:Working on Protein
16:09:26:WU00:FS00:0x11:Client config unavailable.
16:09:26:WU00:FS00:0x11:Resuming from checkpoint
16:09:26:WU00:FS00:0x11:Starting GUI Server
16:09:26:WU00:FS00:0x11:fcCheckPointResume: retreived and current tpr file hash:
16:09:26:WU00:FS00:0x11: 0 340528376 340528376
16:09:26:WU00:FS00:0x11: 1 1656272085 1656272085
16:09:26:WU00:FS00:0x11: 2 3166580509 3166580509
16:09:26:WU00:FS00:0x11: 3 3551419480 3551419480
16:09:26:WU00:FS00:0x11: 4 1039727158 1039727158
16:09:26:WU00:FS00:0x11:fcCheckPointResume: file hashes same.
16:09:26:WU00:FS00:0x11:fcCheckPointResume: state restored.
16:09:26:WU00:FS00:0x11:Verified 00/wudata_01.log
16:09:26:WU00:FS00:0x11:Verified 00/wudata_01.edr
16:09:27:WU00:FS00:0x11:Verified 00/wudata_01.xtc
16:09:27:WU00:FS00:0x11:Completed 91%
16:10:49:WU00:FS00:0x11:Completed 92%
16:12:13:WU00:FS00:0x11:Completed 93%
Also today in v6, "Project: 10503 (Run 61, Clone 0, Gen 194)" is running, but the client has hung - that's something that has happened on a number of occasions also, where the client would hang and not show any frame steps, but will complete and upload the WU successfully, and get started with a new unit; just doesn't show that in the terminal.
Code: Select all
[15:43:27] Preparing to commence simulation
[15:43:27] - Looking at optimizations...
[15:43:27] - Files status OK
[15:43:27] - Expanded 62896 -> 336799 (decompressed 535.4 percent)
[15:43:27] Called DecompressByteArray: compressed_data_size=62896 data_size=336799, decompressed_data_size=336799 diff=0
[15:43:27] - Digital signature verified
[15:43:27]
[15:43:27] Project: 10503 (Run 61, Clone 0, Gen 194)
[15:43:27]
[15:43:27] Assembly optimizations on if available.
[15:43:27] Entering M.D.
Been stuck there for about an hour, and the TPF for these units are usually under a minute according to the ones that have finished.
Maybe the P105xx just have issues with v6?
Re: Unstable Machine/Hanging P10501
Posted: Fri Jun 22, 2012 5:07 pm
by 7im
When you post a log, please include which specific GPU the log is from. You have so many, I can't keep track. Thanks.
Re: Unstable Machine/Hanging P10501
Posted: Fri Jun 22, 2012 9:07 pm
by GenDrexler
From here on all logs are for the 9800GTX - since that's that only system I'm currently gpu folding on.
Think I did mention that in a previous post, tho
Sorry the confusion.
Re: Unstable Machine/Hanging P10501
Posted: Sat Jun 23, 2012 4:38 am
by PantherX
GenDrexler wrote:From here on all logs are for the 9800GTX - since that's that only system I'm currently gpu folding on...
Are you attempting to run v6 and V7 on the same machine simultaneously or was the log from v6 posted before you upgraded to V7 and then posted the log (hence running only V7 on the machine)?
Re: Unstable Machine/Hanging P10501
Posted: Sat Jun 23, 2012 4:54 pm
by GenDrexler
One client at a time; I switched to v7 to see if it would make a difference.
Ok so in v7 today I got this:
Code: Select all
16:09:18:FS00:Unpaused
16:09:18:WU00:FS00:Starting
16:09:18:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Foldermon/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/G80/Core_11.fah/FahCore_11.exe -dir 00 -suffix 01 -version 701 -lifeline 4940 -checkpoint 15 -gpu 0
16:09:18:WU00:FS00:Started FahCore on PID 4656
16:09:18:WU00:FS00:Core PID:5044
16:09:18:WU00:FS00:FahCore 0x11 started
16:09:19:WU00:FS00:0x11:
16:09:19:WU00:FS00:0x11:*------------------------------*
16:09:19:WU00:FS00:0x11:Folding@Home GPU Core
16:09:19:WU00:FS00:0x11:Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
16:09:19:WU00:FS00:0x11:
16:09:19:WU00:FS00:0x11:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
16:09:19:WU00:FS00:0x11:Build host: amoeba
16:09:19:WU00:FS00:0x11:Board Type: Nvidia
16:09:19:WU00:FS00:0x11:Core :
16:09:19:WU00:FS00:0x11:Preparing to commence simulation
16:09:19:WU00:FS00:0x11:- Looking at optimizations...
16:09:19:WU00:FS00:0x11:- Files status OK
16:09:19:WU00:FS00:0x11:- Expanded 62814 -> 336799 (decompressed 536.1 percent)
16:09:19:WU00:FS00:0x11:Called DecompressByteArray: compressed_data_size=62814 data_size=336799, decompressed_data_size=336799 diff=0
16:09:19:WU00:FS00:0x11:- Digital signature verified
16:09:19:WU00:FS00:0x11:
16:09:19:WU00:FS00:0x11:Project: 10504 (Run 418, Clone 0, Gen 594)
16:09:19:WU00:FS00:0x11:
16:09:19:WU00:FS00:0x11:Assembly optimizations on if available.
16:09:19:WU00:FS00:0x11:Entering M.D.
16:09:22:FS00:Finishing
16:09:24:WU00:FS00:0x11:Will resume from checkpoint file
16:09:24:WU00:FS00:0x11:Tpr hash 00/wudata_01.tpr: 340528376 1656272085 3166580509 3551419480 1039727158
16:09:24:WU00:FS00:0x11:
16:09:24:WU00:FS00:0x11:Calling fah_main args: 14 usage=100
16:09:24:WU00:FS00:0x11:
16:09:25:WU00:FS00:0x11:Working on Protein
16:09:26:WU00:FS00:0x11:Client config unavailable.
16:09:26:WU00:FS00:0x11:Resuming from checkpoint
16:09:26:WU00:FS00:0x11:Starting GUI Server
16:09:26:WU00:FS00:0x11:fcCheckPointResume: retreived and current tpr file hash:
16:09:26:WU00:FS00:0x11: 0 340528376 340528376
16:09:26:WU00:FS00:0x11: 1 1656272085 1656272085
16:09:26:WU00:FS00:0x11: 2 3166580509 3166580509
16:09:26:WU00:FS00:0x11: 3 3551419480 3551419480
16:09:26:WU00:FS00:0x11: 4 1039727158 1039727158
16:09:26:WU00:FS00:0x11:fcCheckPointResume: file hashes same.
16:09:26:WU00:FS00:0x11:fcCheckPointResume: state restored.
16:09:26:WU00:FS00:0x11:Verified 00/wudata_01.log
16:09:26:WU00:FS00:0x11:Verified 00/wudata_01.edr
16:09:27:WU00:FS00:0x11:Verified 00/wudata_01.xtc
16:09:27:WU00:FS00:0x11:Completed 91%
16:10:49:WU00:FS00:0x11:Completed 92%
16:12:13:WU00:FS00:0x11:Completed 93%
16:13:37:WU00:FS00:0x11:Completed 94%
16:15:01:WU00:FS00:0x11:Completed 95%
16:16:25:WU00:FS00:0x11:Completed 96%
16:17:49:WU00:FS00:0x11:Completed 97%
16:19:13:WU00:FS00:0x11:Completed 98%
16:20:38:WU00:FS00:0x11:Completed 99%
16:22:03:WU00:FS00:0x11:Completed 100%
16:22:05:WU00:FS00:0x11:Successful run
16:22:05:WU00:FS00:0x11:DynamicWrapper: Finished Work Unit: sleep=10000
16:22:15:WU00:FS00:0x11:Reserved 109504 bytes for xtc file; Cosm status=0
16:22:15:WU00:FS00:0x11:Allocated 109504 bytes for xtc file
16:22:15:WU00:FS00:0x11:- Reading up to 109504 from "00/wudata_01.xtc": Read 109504
16:22:15:WU00:FS00:0x11:Read 109504 bytes from xtc file; available packet space=786320960
16:22:15:WU00:FS00:0x11:xtc file hash check passed.
16:22:15:WU00:FS00:0x11:Reserved 21912 21912 786320960 bytes for arc file=<00/wudata_01.trr> Cosm status=0
16:22:15:WU00:FS00:0x11:Allocated 21912 bytes for arc file
16:22:15:WU00:FS00:0x11:- Reading up to 21912 from "00/wudata_01.trr": Read 21912
16:22:15:WU00:FS00:0x11:Read 21912 bytes from arc file; available packet space=786299048
16:22:15:WU00:FS00:0x11:trr file hash check passed.
16:22:15:WU00:FS00:0x11:Allocated 560 bytes for edr file
16:22:15:WU00:FS00:0x11:Read bedfile
16:22:15:WU00:FS00:0x11:edr file hash check passed.
16:22:15:WU00:FS00:0x11:Logfile not read.
16:22:15:WU00:FS00:0x11:GuardedRun: success in DynamicWrapper
16:22:15:WU00:FS00:0x11:GuardedRun: done
16:22:15:WU00:FS00:0x11:Run: GuardedRun completed.
16:22:18:WU00:FS00:0x11:+ Opened results file
16:22:18:WU00:FS00:0x11:- Writing 132488 bytes of core data to disk...
16:22:19:WU00:FS00:0x11:Done: 131976 -> 131015 (compressed to 99.2 percent)
16:22:19:WU00:FS00:0x11: ... Done.
16:22:19:WU00:FS00:0x11:DeleteFrameFiles: successfully deleted file=00/wudata_01.ckp
16:22:19:WU00:FS00:0x11:Shutting down core
16:22:19:WU00:FS00:0x11:
16:22:19:WU00:FS00:0x11:Folding@home Core Shutdown: FINISHED_UNIT
16:22:19:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:22:19:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:10504 run:418 clone:0 gen:594 core:0x11 unit:0x000005c66652eda54b76d40f0000109e
16:22:19:WU00:FS00:Uploading 128.44KiB to 171.67.108.21
16:22:19:WU00:FS00:Connecting to 171.67.108.21:8080
16:22:24:WU00:FS00:Upload complete
16:22:24:WU00:FS00:Server responded WORK_ACK (400)
16:22:24:WU00:FS00:Cleaning up
******************************** Date: 23/06/12 ********************************
05:31:34:WU00:FS00:Connecting to assign-GPU.stanford.edu:80
05:31:35:WU00:FS00:News: Welcome to Folding@Home
05:31:35:WU00:FS00:Assigned to work server 171.67.108.21
05:31:36:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:"G92 [GeForce 9800 GTX]" from 171.67.108.21
05:31:36:WU00:FS00:Connecting to 171.67.108.21:8080
05:31:36:WU00:FS00:Downloading 61.99KiB
05:31:38:WU00:FS00:Download complete
05:31:38:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:OK project:10503 run:52 clone:1 gen:274 core:0x11 unit:0x000002876652eda54b7169d200001451
05:31:39:WU00:FS00:Starting
05:31:39:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Foldermon/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/G80/Core_11.fah/FahCore_11.exe -dir 00 -suffix 01 -version 701 -lifeline 4940 -checkpoint 15 -gpu 0
05:31:39:WU00:FS00:Started FahCore on PID 5904
05:31:40:WU00:FS00:Core PID:3896
05:31:40:WU00:FS00:FahCore 0x11 started
05:31:42:WU00:FS00:0x11:
05:31:42:WU00:FS00:0x11:*------------------------------*
05:31:42:WU00:FS00:0x11:Folding@Home GPU Core
05:31:42:WU00:FS00:0x11:Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
05:31:42:WU00:FS00:0x11:
05:31:42:WU00:FS00:0x11:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
05:31:42:WU00:FS00:0x11:Build host: amoeba
05:31:42:WU00:FS00:0x11:Board Type: Nvidia
05:31:42:WU00:FS00:0x11:Core :
05:31:42:WU00:FS00:0x11:Preparing to commence simulation
05:31:42:WU00:FS00:0x11:- Looking at optimizations...
05:31:42:WU00:FS00:0x11:DeleteFrameFiles: successfully deleted file=00/wudata_01.ckp
05:31:42:WU00:FS00:0x11:- Created dyn
05:31:42:WU00:FS00:0x11:- Files status OK
05:31:42:WU00:FS00:0x11:- Expanded 62969 -> 336799 (decompressed 534.8 percent)
05:31:42:WU00:FS00:0x11:Called DecompressByteArray: compressed_data_size=62969 data_size=336799, decompressed_data_size=336799 diff=0
05:31:42:WU00:FS00:0x11:- Digital signature verified
05:31:42:WU00:FS00:0x11:
05:31:42:WU00:FS00:0x11:Project: 10503 (Run 52, Clone 1, Gen 274)
05:31:42:WU00:FS00:0x11:
05:31:42:WU00:FS00:0x11:Assembly optimizations on if available.
05:31:42:WU00:FS00:0x11:Entering M.D.
05:31:48:WU00:FS00:0x11:Tpr hash 00/wudata_01.tpr: 2796742070 1452992016 1609237867 668796229 1314925904
05:31:48:WU00:FS00:0x11:
05:31:48:WU00:FS00:0x11:Calling fah_main args: 14 usage=100
05:31:48:WU00:FS00:0x11:
05:31:54:WU00:FS00:0x11:Working on Protein
05:31:54:WU00:FS00:0x11:mdrun_gpu returned
05:31:54:WU00:FS00:0x11:Going to send back what have done -- stepsTotalG=0
05:31:54:WU00:FS00:0x11:Work fraction=0.0000 steps=0.
05:31:59:WU00:FS00:0x11:logfile size=1245 infoLength=1245 edr=0 trr=25
05:31:59:WU00:FS00:0x11:+ Opened results file
05:31:59:WU00:FS00:0x11:- Writing 1783 bytes of core data to disk...
05:31:59:WU00:FS00:0x11:Done: 1271 -> 719 (compressed to 56.5 percent)
05:31:59:WU00:FS00:0x11: ... Done.
05:31:59:WU00:FS00:0x11:DeleteFrameFiles: successfully deleted file=00/wudata_01.ckp
05:31:59:WU00:FS00:0x11:
05:31:59:WU00:FS00:0x11:Folding@home Core Shutdown: UNSTABLE_MACHINE
You can see where it continued the P10504 which completed successfully, and then it downloaded a P10503, which resulted in a "unstable machine".
So v6 and v7 are getting the same thing with P10503.