Page 1 of 1
Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 11:06 am
by v00d00
These are the ones ive found so far, but i suspect its dumped many more. But reading through the frankly schizophrenic log format is hard on the eyes, so i gave up after 2 files worth.
Project: 8900 (Run 153, Clone 0, Gen 70)
Project: 8900 (Run 647, Clone 1, Gen 64)
Hardware is a stock GTX 460 running on Fedora 18. Fan works fine, temperatures are within limits, and its processing (and completing) other projects fine, just not this one.
Some log excerpts.
Code: Select all
18:50:46:WU02:FS00:0x17:Project: 8900 (Run 647, Clone 1, Gen 64)18:50:46:WU02:FS00:0x17:Unit: 0x0000005d028c126651a6b68dc17573b0
18:50:46:WU02:FS00:0x17:CPU: 0x00000000000000000000000000000000
18:50:46:WU02:FS00:0x17:Machine: 0
18:50:46:WU02:FS00:0x17:Reading tar file state.xml
18:50:46:WU02:FS00:0x17:Reading tar file system.xml
18:50:47:WU02:FS00:0x17:Reading tar file integrator.xml
18:50:47:WU02:FS00:0x17:Reading tar file core.xml
18:50:47:WU02:FS00:0x17:Digital signatures verified
18:50:51:WU01:FS00:Upload 25.08%
18:50:57:WU01:FS00:Upload 38.54%
18:51:03:WU01:FS00:Upload 52.61%
18:51:09:WU01:FS00:Upload 66.07%
18:51:15:WU01:FS00:Upload 79.53%
18:51:21:WU01:FS00:Upload 92.99%
18:51:24:WU01:FS00:Upload complete
18:51:24:WU01:FS00:Server responded WORK_ACK (400)
18:51:24:WU01:FS00:Cleaning up
18:53:24:WU02:FS00:0x17:Completed 0 out of 2500000 steps (0%)
19:06:19:WU02:FS00:0x17:Completed 25000 out of 2500000 steps (1%)
19:18:58:WU02:FS00:0x17:Completed 50000 out of 2500000 steps (2%)
19:31:52:WU02:FS00:0x17:Completed 75000 out of 2500000 steps (3%)
19:44:32:WU02:FS00:0x17:Completed 100000 out of 2500000 steps (4%)
20:07:00:WU02:FS00:0x17:Completed 125000 out of 2500000 steps (5%)
20:32:01:WU02:FS00:0x17:Completed 150000 out of 2500000 steps (6%)
20:32:23:WU02:FS00:0x17:Bad State detected... attempting to resume from last good checkpoint
20:46:41:WU02:FS00:0x17:Completed 125000 out of 2500000 steps (5%)
21:11:43:WU02:FS00:0x17:Completed 150000 out of 2500000 steps (6%)
21:12:01:WU02:FS00:0x17:Bad State detected... attempting to resume from last good checkpoint
21:24:41:WU02:FS00:0x17:Completed 125000 out of 2500000 steps (5%)
18:50:33:WU01:FS00:0x17:Saving result file badStateForceGroup2_25409491Ref.xml
Each time it dumps one it generates this text, but with different end numbers.
Code: Select all
18:48:39:WU01:FS00:0x17:Saving result file badStateCheckpoint_1216173900
18:48:41:WU01:FS00:0x17:Saving result file badStateCheckpoint_1536125667
18:48:45:WU01:FS00:0x17:Saving result file badStateCheckpoint_25409491
18:48:48:WU01:FS00:0x17:Saving result file badStateForceGroup0_1216173900Core.xml
18:48:54:WU01:FS00:0x17:Saving result file badStateForceGroup0_1216173900Ref.xml
18:49:00:WU01:FS00:0x17:Saving result file badStateForceGroup0_1536125667Core.xml
18:49:05:WU01:FS00:0x17:Saving result file badStateForceGroup0_1536125667Ref.xml
18:49:10:WU01:FS00:0x17:Saving result file badStateForceGroup0_25409491Core.xml
18:49:16:WU01:FS00:0x17:Saving result file badStateForceGroup0_25409491Ref.xml
18:49:21:WU01:FS00:0x17:Saving result file badStateForceGroup1_1216173900Core.xml
18:49:28:WU01:FS00:0x17:Saving result file badStateForceGroup1_1216173900Ref.xml
18:49:34:WU01:FS00:0x17:Saving result file badStateForceGroup1_1536125667Core.xml
18:49:42:WU01:FS00:0x17:Saving result file badStateForceGroup1_1536125667Ref.xml
18:49:47:WU01:FS00:0x17:Saving result file badStateForceGroup1_25409491Core.xml
18:49:56:WU01:FS00:0x17:Saving result file badStateForceGroup1_25409491Ref.xml
18:50:02:WU01:FS00:0x17:Saving result file badStateForceGroup2_1216173900Core.xml
18:50:08:WU01:FS00:0x17:Saving result file badStateForceGroup2_1216173900Ref.xml
18:50:15:WU01:FS00:0x17:Saving result file badStateForceGroup2_1536125667Core.xml
18:50:21:WU01:FS00:0x17:Saving result file badStateForceGroup2_1536125667Ref.xml
18:50:27:WU01:FS00:0x17:Saving result file badStateForceGroup2_25409491Core.xml
It would make it a lot easier if the web client didnt dump its debug info into the main log file. I suppose i can get rid of it by dropping verbosity, but then i might lose other useful info. Would be nicer if the webclient had a webclient.log, and we could go back to having a separate log file for each slot, because at present its a bit of a nightmare to read it.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 12:45 pm
by 7im
Please note the options to filter the log by slot # or WU # in fah control.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 12:47 pm
by PantherX
Project: 8900 (Run 153, Clone 0, Gen 70) -> 2 failures and 1 success.
Project: 8900 (Run 647, Clone 1, Gen 64) -> 1 failure and 1 success.
What do you mean by "other projects fine, just not this one", do you mean only these WUs or all WUs from Project 8900. Are you successfully folding Project 7810 and 7811 on your GPU?
Since the current log is combination of all slots, have you tried to use the log filter in Advanced Control (AKA FAHControl)? It does make it easier to read but you are limited only to the current log and not the previous logs.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 12:51 pm
by Joe_H
v00d00 wrote:It would make it a lot easier if the web client didnt dump its debug info into the main log file. I suppose i can get rid of it by dropping verbosity, but then i might lose other useful info.
If you have changed the verbosity level to higher than the default of 3, it is strongly recommended that it be changed back. A higher verbosity level gives little useful information and gets in the way of debugging problems.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 1:06 pm
by rickoic
You can look at what each of your folding units is putting into the log file by clicking on the box next to Unit: and then choosing which Unit from the little down arrow just to the right of it; doing that you can look at what each individual unit is putting into the log file.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 5:06 pm
by bruce
v00d00 wrote:It would make it a lot easier if the web client didnt dump its debug info into the main log file. I suppose i can get rid of it by dropping verbosity, but then i might lose other useful info. Would be nicer if the webclient had a webclient.log, and we could go back to having a separate log file for each slot, because at present its a bit of a nightmare to read it.
I understand your desire (and personally agree with it), but I suspect it's not going to work. The FAHClient and each FahCore are developed independently. The V7 verbosity setting applies to FAHClient. I know of no way to reduce the verbosity of the FahCore -- but if you find one, be sure to let us know.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 5:52 pm
by v00d00
I know Bruce. It wont happen. I guess i could write a parser to separate out the useful stuff.
@Joe_H, i used to run verbosity 9 on every client since v3, so i just set this one to max like i normally did, also it used to be a requirement for those participating in beta, in case things went wrong and the extra info was needed. If you are saying it generates sufficient info at verbosity 3 to debug workunits and problems in the client, then i will reduce it.
@Panther, yes i am completing/have completed several 7810/7811 without any problems. I have completed 4 in the logs ive already looked at, and ive seen others in the last couple of days. But i am dumping all P8900 with those state errors.
@7im/rickoic, i dont actually have an FAHControl. Only things installed are FAHClient and FAHCoreWrapper. Ive used it on windows a couple of times, but it wasnt all that useful, except for i guess looking at log files. Im more of a console guy. But we've had that conversation before. For now i just wrote a bash script to give me the info i need. I'll hack together something better when i have some free time.
I have generated a dump of all errored P8900's. just as a FYI. All sent back with the error: FAULTY
Code: Select all
P8900 (Run518, Clone 1, Gen 78)
P8900 (Run732, Clone 1, Gen 87)
P8900 (Run303, Clone 1, Gen 79)
P8900 (Run399, Clone 0, Gen 79)
P8900 (Run465, Clone 0, Gen 84)
P8900 (Run153, Clone 0, Gen 70)
P8900 (Run647, Clone 1, Gen 64)
[edit]
I've decided to drop GPU off beta/advanced for now, and i'll retry it in a month or two.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 6:11 pm
by P5-133XL
P8900 (Run518, Clone 1, Gen 78) was successfully completed by someone else.
P8900 (Run732, Clone 1, Gen 87) was successfully completed by someone else.
P8900 (Run303, Clone 1, Gen 79) was successfully completed by someone else.
P8900 (Run399, Clone 0, Gen 79) was successfully completed by someone else.
P8900 (Run465, Clone 0, Gen 84) was successfully completed by someone else.
P8900 (Run153, Clone 0, Gen 70) was successfully completed by someone else.
P8900 (Run647, Clone 1, Gen 64) was successfully completed by someone else.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 6:21 pm
by v00d00
Thats cool.
My hardware seems to just hate those workunits. Always theirs some Project it hates. I cant win.
C'est la vie.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 6:39 pm
by bruce
v00d00 wrote:I know Bruce. It wont happen. I guess i could write a parser to separate out the useful stuff.
@Joe_H, i used to run verbosity 9 on every client since v3, so i just set this one to max like i normally did, also it used to be a requirement for those participating in beta, in case things went wrong and the extra info was needed. If you are saying it generates sufficient info at verbosity 3 to debug workunits and problems in the client, then i will reduce it.
The verbosity 9 setting in V6 passed a parameter to the FahCores which
increased the verbosity of some of the FahCores. That was sometimes useful for the Cores that were involved in beta at the time and that could be helpful in debugging the core and/or WU. You're asking to
reduce the verbosity of a FahCore.
Also (as I said) the V7 verbosity=N setting only applies to messages from FAHClient, not from the FahCore. Getting more messages from the client has not been found to be useful.
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 01, 2013 8:11 pm
by PantherX
Is your GTX 460 having default frequencies or is it running a factory overclock? If is it having a factory overclock, you may have to flash your GPU to the default frequencies and see if it happens again.
Is there sufficient VRAM on your system? The reason is that Project 8900 is currently the largest one for FahCore_17 so that might be causing the issue. Is your PSU powerful enough to handle the GPU load?
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Fri Aug 02, 2013 7:08 pm
by v00d00
@Panther, default, stock clocks, 1 GB Ram (maybe its this thats the problem), i never considered the possibility it may need more than 1GB ram, i will add buying a new GPU to my list of things, i suppose it was always going to happen that GPUs would need more ram to process workunits. What should i aim for, 2GB or 3GB? Im in budget for maybe a GTX580 with 3GB, and ive found one for less than the price of a GTX660, brand new i might add. Also PSU is an Antec HCG-750 and its nowhere close to being taxed (at around 480w currently).
@Bruce. Thanks, that clears it up i will get rid of verbosity. What setting would you recommend? I have it on 3 currently, is it worth dropping to 2 or 1?
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Fri Aug 02, 2013 8:19 pm
by 7im
3 is the recommended verbosity setting (the default).
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 15, 2013 12:02 am
by PantherX
v00d00 wrote:@Panther, default, stock clocks, 1 GB Ram (maybe its this thats the problem), i never considered the possibility it may need more than 1GB ram, i will add buying a new GPU to my list of things, i suppose it was always going to happen that GPUs would need more ram to process workunits. What should i aim for, 2GB or 3GB? Im in budget for maybe a GTX580 with 3GB, and ive found one for less than the price of a GTX660, brand new i might add...
The VRAM requirement depends on the usage of your GPU. If it is dedicated to folding and nothing else runs on the system, 1 GB would be fine for now but I would settle for 2 GBs just to be sure. However, if you are running other applications, that might cause an issue since an increasing number of applications are now using the GPU and thus, you may be hitting a VRAM limitation without even knowing it unless you monitor the VRAM usage in real-time. One issue is that the application thinks that it "owns" the GPU and thus, all the VRAM is available to it which causes issues when other applications too are using it. Since there isn't a concept of swap space yet for GPU, unexpected behavior may result in this situation.
I did encounter the VRAM issue with my GTX 260 SOC 896 MB and FahCore_11 returned an error that didn't even relate to the lack of VRAM. I noticed this since I was monitoring my VRAM in real-time and thus was able to reproduce it and confirm that insufficient VRAM causes issues. Here's the error in case you were wondering:
Code: Select all
01:22:33:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/PantherX/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/G80/Core_11.fah/FahCore_11.exe -dir 01 -suffix 01 -version 702 -lifeline 3564 -checkpoint 15 -gpu 0
01:22:33:WU01:FS00:Started FahCore on PID 5140
01:22:33:WU01:FS00:Core PID:4644
01:22:33:WU01:FS00:FahCore 0x11 started
01:22:33:WU01:FS00:0x11:
01:22:33:WU01:FS00:0x11:*------------------------------*
01:22:33:WU01:FS00:0x11:Folding@Home GPU Core
01:22:33:WU01:FS00:0x11:Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
01:22:33:WU01:FS00:0x11:
01:22:33:WU01:FS00:0x11:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
01:22:33:WU01:FS00:0x11:Build host: amoeba
01:22:33:WU01:FS00:0x11:Board Type: Nvidia
01:22:33:WU01:FS00:0x11:Core :
01:22:33:WU01:FS00:0x11:Preparing to commence simulation
01:22:33:WU01:FS00:0x11:- Looking at optimizations...
01:22:33:WU01:FS00:0x11:DeleteFrameFiles: successfully deleted file=01/wudata_01.ckp
01:22:33:WU01:FS00:0x11:- Created dyn
01:22:33:WU01:FS00:0x11:- Files status OK
01:22:33:WU01:FS00:0x11:- Expanded 45476 -> 251112 (decompressed 552.1 percent)
01:22:33:WU01:FS00:0x11:Called DecompressByteArray: compressed_data_size=45476 data_size=251112, decompressed_data_size=251112 diff=0
01:22:33:WU01:FS00:0x11:- Digital signature verified
01:22:33:WU01:FS00:0x11:
01:22:33:WU01:FS00:0x11:Project: 5769 (Run 6, Clone 384, Gen 2830)
01:22:33:WU01:FS00:0x11:
01:22:33:WU01:FS00:0x11:Assembly optimizations on if available.
01:22:33:WU01:FS00:0x11:Entering M.D.
01:22:38:WU02:FS00:Upload complete
01:22:38:WU02:FS00:Server responded WORK_ACK (400)
01:22:38:WU02:FS00:Cleaning up
01:22:39:WU01:FS00:0x11:Tpr hash 01/wudata_01.tpr: 1586263460 1068737893 995456632 1673198755 1994931254
01:22:39:WU01:FS00:0x11:
01:22:39:WU01:FS00:0x11:Calling fah_main args: 14 usage=100
01:22:39:WU01:FS00:0x11:
01:22:39:WU01:FS00:0x11:mdrun_gpu returned
01:22:39:WU01:FS00:0x11:Going to send back what have done -- stepsTotalG=0
01:22:39:WU01:FS00:0x11:Work fraction=0.0000 steps=0.
01:22:43:WU01:FS00:0x11:logfile size=4947 infoLength=4947 edr=0 trr=25
01:22:43:WU01:FS00:0x11:+ Opened results file
01:22:43:WU01:FS00:0x11:- Writing 5485 bytes of core data to disk...
01:22:43:WU01:FS00:0x11:Done: 4973 -> 1848 (compressed to 37.1 percent)
01:22:43:WU01:FS00:0x11: ... Done.
01:22:43:WU01:FS00:0x11:DeleteFrameFiles: successfully deleted file=01/wudata_01.ckp
01:22:43:WU01:FS00:0x11:
01:22:43:WU01:FS00:0x11:Folding@home Core Shutdown: UNSTABLE_MACHINE
01:22:43:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
01:22:44:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:5769 run:6 clone:384 gen:2830 core:0x11 unit:0x7f129ca350d116d40b0e018000061689
Re: Project: 8900 (Run 153, Clone 0, Gen 70)
Posted: Thu Aug 15, 2013 12:42 am
by bruce
Few people remember the hassles of RAM management before WinXP. Win3.x did have a paging file so it was better than older OSs but it still had fragmentation issues and you got strange errors if you ever filled up virtual memory. Managing GPU VRAM is a lot more like MSDOS than even Windows 3.x. Stream Computing is still in it's infancy. FAH's GPU code is written to use very little VRAM, so as PX suggests, you probably won't be wishing you had more VRAM if you're just running FAH plus a monitor that's not outfitted with a massive number of pixels. That changes if you have multiple monitors and/or very high resolution screen settings.