Page 1 of 1
Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 8:18 pm
by FaaR
It doesn't progress at all on one of my PCs, not a single step completed despite running on all CPU cores for an extended period of time (20-30 mins, maybe; the CPU is fully loaded to 100% on all 4 cores.) I refresh the log - no change.
Input 'preciated.

Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 8:28 pm
by 7im
As always, we'd like to see the log file.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 8:37 pm
by Joe_H
The WU's from Project 8583 are somewhat larger than many other WU's. They are not a project that gets assigned to my OS X systems, but having processed similar sized WU's I would give it more time. I would expect something like this to take at least 30-45 minutes to process a frame (1%) on my iMac with 2.8 GHz i7.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 8:52 pm
by FaaR
I would have posted the log, but since that client doesn't run on my work box (and thus has no easy way to post to message boards from

) and the log didn't say anything special at the time of my original post (just the usual startup stuff, and then stopping on the line "Completed 0 out of xxx steps etc etc", I perhaps unwisely thought there wasn't much point in bothering.
Anyway, I restarted my PC, and now after resuming, it has actually completed some steps I see.
Isn't it supposed to write to the log whenever it checkpoint-saves? It doesn't seem to be doing that. And why does it keep deleting the log and starting over all the time...? There used to be a lot more stuff from earlier today.
Code: Select all
*********************** Log Started 2013-06-21T20:32:58Z ***********************
20:32:58:WU00:FS00:Starting
20:32:58:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Lenny/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 00 -suffix 01 -version 703 -lifeline 4344 -checkpoint 5 -np 7
20:32:58:WU00:FS00:Started FahCore on PID 2580
20:32:58:WU00:FS00:Core PID:4972
20:32:58:WU00:FS00:FahCore 0xa3 started
20:32:58:WU00:FS00:0xa3:
20:32:58:WU00:FS00:0xa3:*------------------------------*
20:32:58:WU00:FS00:0xa3:Folding@Home Gromacs SMP Core
20:32:58:WU00:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
20:32:58:WU00:FS00:0xa3:
20:32:58:WU00:FS00:0xa3:Preparing to commence simulation
20:32:58:WU00:FS00:0xa3:- Looking at optimizations...
20:32:58:WU00:FS00:0xa3:- Files status OK
20:32:58:WU00:FS00:0xa3:- Expanded 3849128 -> 4382484 (decompressed 113.8 percent)
20:32:58:WU00:FS00:0xa3:Called DecompressByteArray: compressed_data_size=3849128 data_size=4382484, decompressed_data_size=4382484 diff=0
20:32:58:WU00:FS00:0xa3:- Digital signature verified
20:32:58:WU00:FS00:0xa3:
20:32:58:WU00:FS00:0xa3:Project: 8583 (Run 0, Clone 3, Gen 77)
20:32:58:WU00:FS00:0xa3:
20:32:58:WU00:FS00:0xa3:Assembly optimizations on if available.
20:32:58:WU00:FS00:0xa3:Entering M.D.
20:33:05:WU00:FS00:0xa3:Using Gromacs checkpoints
20:33:05:WU00:FS00:0xa3:Mapping NT from 7 to 7
20:33:21:WU00:FS00:0xa3:Resuming from checkpoint
20:33:21:WU00:FS00:0xa3:Verified 00/wudata_01.log
20:33:21:WU00:FS00:0xa3:Verified 00/wudata_01.trr
20:33:21:WU00:FS00:0xa3:Verified 00/wudata_01.edr
20:33:22:WU00:FS00:0xa3:Completed 110 out of 500000 steps (0%)
20:45:01:FS00:Paused
20:45:01:FS00:Shutting core down
20:45:06:WU00:FS00:0xa3:Client no longer detected. Shutting down core
20:45:06:WU00:FS00:0xa3:
20:45:06:WU00:FS00:0xa3:Folding@home Core Shutdown: CLIENT_DIED
20:45:07:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
20:45:31:FS00:Unpaused
20:45:31:WU00:FS00:Starting
20:45:31:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Lenny/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 00 -suffix 01 -version 703 -lifeline 4344 -checkpoint 5 -np 7
20:45:31:WU00:FS00:Started FahCore on PID 4132
20:45:31:WU00:FS00:Core PID:1172
20:45:31:WU00:FS00:FahCore 0xa3 started
20:45:31:WU00:FS00:0xa3:
20:45:31:WU00:FS00:0xa3:*------------------------------*
20:45:31:WU00:FS00:0xa3:Folding@Home Gromacs SMP Core
20:45:31:WU00:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
20:45:31:WU00:FS00:0xa3:
20:45:31:WU00:FS00:0xa3:Preparing to commence simulation
20:45:31:WU00:FS00:0xa3:- Looking at optimizations...
20:45:31:WU00:FS00:0xa3:- Files status OK
20:45:31:WU00:FS00:0xa3:- Expanded 3849128 -> 4382484 (decompressed 113.8 percent)
20:45:31:WU00:FS00:0xa3:Called DecompressByteArray: compressed_data_size=3849128 data_size=4382484, decompressed_data_size=4382484 diff=0
20:45:31:WU00:FS00:0xa3:- Digital signature verified
20:45:31:WU00:FS00:0xa3:
20:45:31:WU00:FS00:0xa3:Project: 8583 (Run 0, Clone 3, Gen 77)
20:45:31:WU00:FS00:0xa3:
20:45:31:WU00:FS00:0xa3:Assembly optimizations on if available.
20:45:31:WU00:FS00:0xa3:Entering M.D.
20:45:37:WU00:FS00:0xa3:Using Gromacs checkpoints
20:45:37:WU00:FS00:0xa3:Mapping NT from 7 to 7
20:45:44:WU00:FS00:0xa3:Resuming from checkpoint
20:45:44:WU00:FS00:0xa3:Verified 00/wudata_01.log
20:45:44:WU00:FS00:0xa3:Verified 00/wudata_01.trr
20:45:44:WU00:FS00:0xa3:Verified 00/wudata_01.edr
20:45:56:WU00:FS00:0xa3:Completed 240 out of 500000 steps (0%)
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 9:22 pm
by Joe_H
A number of actions such as configuration reinitialize the log window in FAHControl. Using the Refresh button near the log window will attempt to reload the log from the beginning. The log file is also present on your drive, you can access that to read it and copy its entries directly as well.
As for what is written in the log, usually you will see an entry every percent completed. In the case of this WU every 5000 steps completed. When processing checkpoints are written is not recorded here, but the default is every 15 minutes. This varies as some work cores write a checkpoint based on a different criteria than elapsed time. In your case I see that the checkpoint is to be written every 5 minutes.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 10:13 pm
by FaaR
Ok, thank you so much for the help and information!

I'll just let it keep running then and hope for the best, although it still hasn't completed one single percent; ugh, what a monster WU this must be.
*Edit: still not convinced this WU isn't borked; I notice it's been folding for at least two hours now and not even ONE percent completed? That means 200+ hours (well over 8 days) until completion at this rate (and WUs seem pretty consistent from start until end from what I've been able to glean), that's just impossible I would think. I'm running this on a Core i7 (nehalem) at 3.4GHz, and while it's folding on twin GPUs also at the same time which leeches about 30-35% total CPU, nothing I've been given up until now has required nearly as much time to complete even with GPUs running simultaneously.
Can this really be right? 8 days!

(And it's only worth 1689 points...

)
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 10:47 pm
by P5-133XL
Check what processes are using your CPU. Very long TPF's can be caused by an outside application using a full CPU core. Because of the highly syncronized nature of SMP folding It run at the speed of the slowest CPU core. When some outside application uses a CPU core, then folding suspends the thread that uses that cpu core and the other folding threads sit in a wait loop for the suspended thread to come back. Specifically make sure that FAHControl hasn't started using a full CPU core all by its own (It occasionally does that and no one yet knows the cause). If it has then shut down FAHControl and restart it.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 10:57 pm
by FaaR
According to task manager, fahcore_a3.exe uses about 74-75% CPU. 2x fahcore_16.exe uses about 9-16% each, it varies a bit up and down. Nothing else uses any appreciable amount of CPU; System hovers at >1% intermittently, but that seems rather insignificant and can't explain this monstrous slowness. It still hasn't completed even 1% by the way, I'm starting to lose hope a little bit...

Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 11:17 pm
by Napoleon
I'm fairly certain you'd be better off folding with just cpu:6. Pause the CPU slot, change slot type to cpu:6, add "extra-core-args" "-np 6" (without quotes) to the CPU slot in FahControl -> Configure -> Slots -> your CPU slot -> Edit -> Add and Finish the WU. Once finished, you can remove the extra-core-args thing.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 11:21 pm
by P5-133XL
could I see the config portion of the log? What CPU and how many threads is the CPU slot allocated? I suspect that you may wish to drop the number of threads it uses by at least one and more likely two cores. Those fahcore_16 threads are causing your fahcore_a3 process to choke.
My default, the SMP process allocates 100% of the CPU. However, each AMD video cards wants a CPU core all to themselves causing the CPU folding to suspend two of its threads and once that happens all the rest of the CPU threads just sit waiting for the return of the suspended threads. i.e. very long TPF's for SMP folding.
So I suggest that you manually modify the folding configuration for the CPU slot to use 6 threads (I think you are using an 4 core CPU + 4 hyper-threaded cores). You can do that from the advanced control->configure->slots tab->cpu slot and then explicitly tell it how many threads to use (total cores less 2 for the AMD slots). Save and restart.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 11:26 pm
by Joe_H
Okay, the GPU folding did not show up in your log file extract. With 2 processes using Core 16, that would be for two AMD-ATI cards. Core 16 is known to require up to an entire CPU core to handle data going to and from the GPU for processing. So at a maximum you would want to have the SMP setting to be 6, if you needed to keep some CPU available for other processes a lower number could be used. The setting for SMP can be done on the configuration page for the CPU slot in FAHControl. The default is -1 which allows the client to determine how many cores to use. That number is reduced by one when the client detects use of an ATI GPU. Ideally it should detect when you have two, but that appears not to have happened for you. If you only have one GPU, then 2 Core 16 processes indicates there is a problem.
P.S. Could you post the beginning of your log that shows the system information and also show the current configuration.
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 11:41 pm
by FaaR
Ok, thanks for the additional advice people. I have folded for a couple days previously with the default setup of cores (Edit: both CPU SMP & 2x GPU) without issue, but there does indeed seem to be some kind of perpetual race condition going on here, seeing as FAH had only completed 0.00276% (1380/500000 steps) in over three hours' time when I tried Napoleon's trick a few minutes ago.
I'm unsure how to supply the config portion of the log, P5-133XL... I haven't folded for like two years or more and this client is entirely new to me. Anyway, it seems to be progressing nicely now though, it completed its first percent just a little while ago, so it might not be necessary for you guys to help me any more.

Btw, I leave it as an exercise to others who are more mathematically inclined to figure out how many years it would have taken to finish the WU at the original rate...
Thanks again for your help, it's been very useful!
Edit:
Yeah. Now we're rolling... 1.48% done, whoo!
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Fri Jun 21, 2013 11:49 pm
by 7im
From the New Donors Start here section...
Log information for Version 7
Re: Is this a bad WU: project 8583 run 0 clone 3 gen 77
Posted: Sat Jun 22, 2013 7:50 pm
by Napoleon
FaaR wrote:Yeah. Now we're rolling... 1.48% done, whoo!
Apparently FahCore_17 / P8900 has been released to advanced without any fanfare, viewtopic.php?f=66&t=24337&start=105#p244401. So, if you set "client-type" to "advanced", you're likely to get core_17 work. That's a game changer for both NV and AMD GPUs. With it, NV GPUs take up one full CPU (NV OpenCL driver feature), whereas AMD GPUs get by with less. Also implements GPU QRB, so you'll probably get better PPD with P8900, too. You still have to stay prepared for AMD GPU core_16 work, though, for the foreseeable future.
Joys of OpenCL - one core to support them all (Fermi / HD 5xxx and newer GPUs)...
EDIT: P8900 / core_17 adv now officially announced - viewtopic.php?f=24&t=14714&start=180#p244414