Page 1 of 1
Adding CPU cores to a CPU slot - restart of WU?
Posted: Thu Jun 13, 2013 8:51 pm
by [WHGT]Cyberman
I added a CPU slot today, thinking I'll let FAH have one or two cores, keeping the rest for the GPUs and whatever.
However, the ETA is now about 5 days - with a timeout deadline of about 6 days.
If I edited the CPU slot to add another core or two, would that speed things up and/or would it download a new WU, aborting this one?
Here's what the log says about my CPU and the WU:
FAH Log wrote:
15:30:25:******************************* System ********************************
15:30:25: CPU: AMD FX(tm)-6100 Six-Core Processor
15:30:25: CPU ID: AuthenticAMD Family 21 Model 1 Stepping 2
15:30:25: CPUs: 6
15:30:25: Memory: 15.95GiB
15:30:25: Free Memory: 12.54GiB
15:30:25: Threads: WINDOWS_THREADS
15:36:34:WU01:FS02:0xa4:Project: 7085 (Run 1, Clone 696, Gen 0)
Power Slider is set to full of course, and I didn't add any special settings. Time per frame is about 1:15 hours.
I don't mind leaving the PC running for 5 days, but only one day of margin is a bit close, IMO. Might be better to quit the WU early if it's not certain it'll finish in time. (I've set the core to "finish" so it shouldn't load a new WU in any case.)
OTOH, I could add more cores if it would speed things up?
Re: Adding CPU cores to a CPU slot - restart of WU?
Posted: Thu Jun 13, 2013 10:41 pm
by Zagen30
At worst it would restart the WU you're working on from the beginning; I really doubt it would dump it completely. I would pause the slot before making the change. Adding cores always speeds things up unless you have a GPU folding that uses significant CPU time AND you're not leaving one core free for each of those GPUs. So, in your case, if you have an AMD GPU that folds (or an Nvidia card and are running the new beta core), you should only increase it to 5 threads instead of 6, leaving one free for the GPU to use.
Re: Adding CPU cores to a CPU slot - restart of WU?
Posted: Fri Jun 14, 2013 1:26 am
by bruce
There's a published warning about changing the number of CPU cores. SOMETIMES when you change the number of cores, the active WU will restart however most of the time it will resume work from the most recent checkpoint. (There's no guarantee either way.) I do change the number of cores sometimes but I always consider both the possible risk and the possible benefit. If the WU is almost finished, don't risk it. If the potential loss of a restart is a small risk, do it.
The Windows Task Manager counts 100% CPU as using all 6 cores, so allocating a single core should it about 16-17% busy plus whatever is being used by other things. If it's varying between 16% and stays under 33%, it would be logical to allocate 5 cores to fah so it varies between 84% and 100%. If it stays very close to 16% (or to 84%) then it's probably worth adding one more core. The basic idea is that if the total allocated RARELY exceeds 100%, it's probably worth using it. If it's consistently over-allocated, SMP will slow down.
Re: Adding CPU cores to a CPU slot - restart of WU?
Posted: Fri Jun 14, 2013 10:55 am
by [WHGT]Cyberman
Well, in the end I decided to dump that WU and remove the CPU slot. Sorry.
Adding more cores was no problem, the WU continued from the last checkpoint.
However, there was no change in time or CPU usage. When I tried to restart the client, a GPU WU got corrupted and dumped
I'll continue to fold only on the GPUs for now, maybe I'll try again some time later (although then I'll do the smart thing and finish the GPU WUs first...)
Thanks for the answers.
Re: Adding CPU cores to a CPU slot - restart of WU?
Posted: Fri Jun 14, 2013 5:54 pm
by bruce
When changing settings that apply to a single slot (CPUs, in this case) the V7 client does not require disturbing other slots (such as a GPU slot), just a Pause/Fold sequence for the CPU slot.
1) Please post the log showing what happened to the GPU slot, both before and after the restart. We may be able to isolate a bug in a GPU core restarting. See the "logs" directory for the previous log.
2) Which CPU WU were you running. Some of the older CPU assignments would only use one core so increasing the number of cores wouldn't change anything until that WU was finished. How long did you let it run before you accepted the client's new estimates of PPD? Those estimates do not change quickly.
3) Whenever possible, avoid dumping WUs that can still be completed before the deadline. It's detrimental to the FAH projects and generally will count against your QRB qualification.
Re: Adding CPU cores to a CPU slot - restart of WU?
Posted: Fri Jun 14, 2013 6:57 pm
by [WHGT]Cyberman
I don't think there was any fault with the FAH cores, only my lack of patience...
bruce wrote:2) Which CPU WU were you running. Some of the older CPU assignments would only use one core so increasing the number of cores wouldn't change anything until that WU was finished.
From the log: 15:36:34:WU01:FS02:0xa4:Project: 7085 (Run 1, Clone 696, Gen 0)
How long did you let it run before you accepted the client's new estimates of PPD? Those estimates do not change quickly.
After looking at the log, I am embarassed to say it's been not even two full minutes. I would have sworn it was longer.
The CPU usage didn't change either, although I now think it may be for the same reason the ETA didn't change (i.e. no checkpoint or such).
bruce wrote:1) Please post the log showing what happened to the GPU slot, both before and after the restart. We may be able to isolate a bug in a GPU core restarting. See the "logs" directory for the previous log.
I think these are the relevant sections, I can upload the whole log if you want:
Code: Select all
04:40:45:FS00:Paused
04:40:45:FS00:Shutting core down
04:40:46:FS01:Paused
04:40:46:FS01:Shutting core down
04:40:49:FS02:Paused
04:40:49:FS02:Shutting core down
04:40:50:WU03:FS00:0x16:Client no longer detected. Shutting down core
04:40:50:WU03:FS00:0x16:
04:40:50:WU03:FS00:0x16:Folding@home Core Shutdown: CLIENT_DIED
04:40:51:WU03:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
04:40:51:WU01:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
04:40:53:WU02:FS01:0x16:Client no longer detected. Shutting down core
04:40:53:WU02:FS01:0x16:
04:40:53:WU02:FS01:0x16:Folding@home Core Shutdown: CLIENT_DIED
04:40:53:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
04:41:00:Clean exit
Code: Select all
04:43:57:WU02:FS01:0x16:*------------------------------*
04:43:57:WU02:FS01:0x16:Folding@Home GPU Core
04:43:57:WU02:FS01:0x16:Version 2.11 (Thu Dec 9 15:00:14 PST 2010)
04:43:57:WU02:FS01:0x16:
04:43:57:WU02:FS01:0x16:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
04:43:57:WU02:FS01:0x16:Build host: user-f6d030f24f
04:43:57:WU02:FS01:0x16:Board Type: AMD/OpenCL
04:43:57:WU02:FS01:0x16:Core : x=16
04:43:57:WU02:FS01:0x16: Window's signal control handler registered.
04:43:57:WU02:FS01:0x16:Preparing to commence simulation
04:43:57:WU02:FS01:0x16:- Looking at optimizations...
04:43:57:WU02:FS01:0x16:- Files status OK
04:43:57:WU02:FS01:0x16:sizeof(CORE_PACKET_HDR) = 512 file=<>
04:43:57:WU02:FS01:0x16:- Expanded 44987 -> 171163 (decompressed 380.4 percent)
04:43:57:WU02:FS01:0x16:Called DecompressByteArray: compressed_data_size=44987 data_size=171163, decompressed_data_size=171163 diff=0
04:43:57:WU02:FS01:0x16:- Digital signature verified
04:43:57:WU02:FS01:0x16:
04:43:57:WU02:FS01:0x16:Project: 11292 (Run 5, Clone 78, Gen 22)
04:43:57:WU02:FS01:0x16:
04:43:57:WU02:FS01:0x16:Assembly optimizations on if available.
04:43:57:WU02:FS01:0x16:Entering M.D.
04:43:59:WU03:FS00:0x16:Will resume from checkpoint file 03/wudata_01.ckp
04:43:59:WU03:FS00:0x16:Tpr hash 03/wudata_01.tpr: 1491281227 3189185972 451057843 870249368 1235014899
04:43:59:WU03:FS00:0x16:Working on ALZHEIMER DISEASE AMYLOID
04:43:59:WU03:FS00:0x16:Client config unavailable.
04:43:59:WU03:FS00:0x16:Starting GUI Server
04:43:59:WU02:FS01:0x16:Will resume from checkpoint file 02/wudata_01.ckp
04:43:59:WU02:FS01:0x16:Tpr hash 02/wudata_01.tpr: 936789823 3194741847 1091015416 3716709719 569462064
04:43:59:WU02:FS01:0x16:Working on ALZHEIMER DISEASE AMYLOID
04:43:59:WU02:FS01:0x16:Client config unavailable.
04:43:59:WU02:FS01:0x16:Starting GUI Server
04:44:02:WU02:FS01:0x16:Run: exception thrown in GuardedRun -- cannot continue further.
04:44:02:WU02:FS01:0x16:Going to send back what have done -- stepsTotalG=0
04:44:02:WU02:FS01:0x16:Work fraction=0.0000 steps=0.
04:44:02:WU03:FS00:0x16:Resuming from checkpoint
04:44:02:WU03:FS00:0x16:fcCheckPointResume: retreived and current tpr file hash:
04:44:02:WU03:FS00:0x16: 0 1491281227 1491281227
04:44:02:WU03:FS00:0x16: 1 3189185972 3189185972
04:44:02:WU03:FS00:0x16: 2 451057843 451057843
04:44:02:WU03:FS00:0x16: 3 870249368 870249368
04:44:02:WU03:FS00:0x16: 4 1235014899 1235014899
04:44:02:WU03:FS00:0x16:fcCheckPointResume: file hashes same.
04:44:02:WU03:FS00:0x16:fcCheckPointResume: state restored.
04:44:02:WU03:FS00:0x16:fcCheckPointResume: name 03/wudata_01.log Verified 03/wudata_01.log
04:44:02:WU03:FS00:0x16:fcCheckPointResume: name 03/wudata_01.trr Verified 03/wudata_01.trr
04:44:02:WU03:FS00:0x16:fcCheckPointResume: name 03/wudata_01.xtc Verified 03/wudata_01.xtc
04:44:02:WU03:FS00:0x16:fcCheckPointResume: name 03/wudata_01.edr Verified 03/wudata_01.edr
04:44:02:WU03:FS00:0x16:fcCheckPointResume: state restored 2
04:44:02:WU03:FS00:0x16:Resumed from checkpoint
04:44:02:WU03:FS00:0x16:Setting checkpoint frequency: 499998
04:44:02:WU03:FS00:0x16:Completed 24499903 out of 49999872 steps (48%).
04:44:02:WU03:FS00:0x16:Completed 24499938 out of 49999872 steps (49%).
04:44:03:WU01:FS02:0xa4:Using Gromacs checkpoints
04:44:03:WU01:FS02:0xa4:Mapping NT from 1 to 1
04:44:03:WU01:FS02:0xa4:Resuming from checkpoint
04:44:03:WU01:FS02:0xa4:Verified 01/wudata_01.log
04:44:03:WU01:FS02:0xa4:Verified 01/wudata_01.trr
04:44:03:WU01:FS02:0xa4:Verified 01/wudata_01.xtc
04:44:03:WU01:FS02:0xa4:Verified 01/wudata_01.edr
04:44:03:WU01:FS02:0xa4:Completed 1080971 out of 10000000 steps (10%)
04:44:06:WU02:FS01:0x16:logfile size=132729 infoLength=132729 edr=0 trr=23
04:44:06:WU02:FS01:0x16:+ Opened results file
04:44:06:WU02:FS01:0x16:- Writing 133265 bytes of core data to disk...
04:44:06:WU02:FS01:0x16:Done: 132753 -> 10349 (compressed to 7.7 percent)
04:44:06:WU02:FS01:0x16: ... Done.
04:44:06:WU02:FS01:0x16:DeleteFrameFiles: successfully deleted file=02/wudata_01.ckp
04:44:06:WU02:FS01:0x16:
04:44:06:WU02:FS01:0x16:Folding@home Core Shutdown: UNSTABLE_MACHINE
04:44:07:WARNING:WU02:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
04:44:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11292 run:5 clone:78 gen:22 core:0x16 unit:0x000006c96652edbc4d096dac506c3456
04:44:07:WU02:FS01:Uploading 10.61KiB to 171.67.108.44
04:44:07:WU02:FS01:Connecting to 171.67.108.44:8080
04:44:07:WU00:FS01:Connecting to assign-GPU.stanford.edu:80
04:44:07:WU02:FS01:Upload complete
04:44:08:WU02:FS01:Server responded WORK_ACK (400)
04:44:08:WU02:FS01:Cleaning up