GPU utilization drops regularily
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 128
- Joined: Wed Feb 01, 2017 7:07 pm
GPU utilization drops regularily
I'm using 7.4.4 and a GTX 1070 on Win10.
There are utilization drops which occur very often, like every few minutes. Is there a reason behind this?
I worry a bit about steep temp gradients which appear every few minutes on the GPU.
Is there a way I can stop this or at least make it happen less frequent? Like a config setting?
In case those drops are necessary - maybe because CPU only is used for some checkup - and it is not possible to continue GPU calculation during those checkups, I would rather prefer that you calculate some garbage with the GPU meanwhile to keep GPU utilization up and therefore temps at the same level.
There are utilization drops which occur very often, like every few minutes. Is there a reason behind this?
I worry a bit about steep temp gradients which appear every few minutes on the GPU.
Is there a way I can stop this or at least make it happen less frequent? Like a config setting?
In case those drops are necessary - maybe because CPU only is used for some checkup - and it is not possible to continue GPU calculation during those checkups, I would rather prefer that you calculate some garbage with the GPU meanwhile to keep GPU utilization up and therefore temps at the same level.
Re: GPU utilization drops regularily
There are two possible things that happen periodically:
* A FAHCore suspends processing long enough to write checkpoint data to disk (via whatever cache you have)
* A GPU FAHCore does a sanity check periodically to make sure the simulation hasn't encountered certain types of errors. It uses the CPU to check the results produced up to that point by the GPU.
For GPU cores (at least), the frequency of these events is set by the scientist in the configuration of the WU.
In my experience, they often seem to happen at the same time, but I'm not sure that's true in all cases.
* A FAHCore suspends processing long enough to write checkpoint data to disk (via whatever cache you have)
* A GPU FAHCore does a sanity check periodically to make sure the simulation hasn't encountered certain types of errors. It uses the CPU to check the results produced up to that point by the GPU.
For GPU cores (at least), the frequency of these events is set by the scientist in the configuration of the WU.
In my experience, they often seem to happen at the same time, but I'm not sure that's true in all cases.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 128
- Joined: Wed Feb 01, 2017 7:07 pm
Re: GPU utilization drops regularily
Thanks again for your fast answer.
I doubt it is the checkpoint data which causes the GPU to stop for a 'long' time like this. I use an SSD and a few MB should be written very fast.
Why can't sanity checks be done in parallel to GPU folding? Why GPU folding has to be stopped?
I would be happy if sanity checks could be done in parallel to GPU folding, then I wouldn't need to worry about my GPU and processing of WUs would be even faster.
I haven't measured but I guess that the processing time of a medium sized WU could be reduced by a few minutes which would result in higher overall GPU yields.
I doubt it is the checkpoint data which causes the GPU to stop for a 'long' time like this. I use an SSD and a few MB should be written very fast.
Why can't sanity checks be done in parallel to GPU folding? Why GPU folding has to be stopped?
I would be happy if sanity checks could be done in parallel to GPU folding, then I wouldn't need to worry about my GPU and processing of WUs would be even faster.
I haven't measured but I guess that the processing time of a medium sized WU could be reduced by a few minutes which would result in higher overall GPU yields.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: GPU utilization drops regularily
GPU usage charts I have seen are square saw-toothed shape. Nature of the beast. How long is "long"? and how often? Except for the checkpoints every 5 frames, the temp shouldn't fluctuate much. What degree of temp changes are you seeing?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Posts: 2040
- Joined: Sat Dec 01, 2012 3:43 pm
- Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441
Re: GPU utilization drops regularily
I don't get it, what is the problem when GPU temps do not stay at the full load level during checkpoints?foldinghomealone wrote:I would rather prefer that you calculate some garbage with the GPU meanwhile to keep GPU utilization up and therefore temps at the same level.
-
- Posts: 230
- Joined: Mon Dec 12, 2016 4:06 am
Re: GPU utilization drops regularily
If you mean something like this:
That's perfectly normal.
That's perfectly normal.
-
- Posts: 128
- Joined: Wed Feb 01, 2017 7:07 pm
Re: GPU utilization drops regularily
That's a perfect waste of timeComputerGenie wrote:That's perfectly normal.
Currently I'm folding Project 9151 (7, 21, 607). Utilization drops every 4 frames for about 5 sec. Temp reduces about 14-18K.
https://ibb.co/d4Eu6F
On other WUs I can hear every time when a checkpoint is reached because temps drop much further and therefore the fans almost stop.
I would prefer constant temps for durability of GPU.
And I would prefer that whatever causes the utilization drops to do it parallel to CPU computing.
5secs every 4 frames means that the WU takes 2mins longer or around 2% than (in my opinion) necessary.
What is the reason to stop GPU computing to write checkpoints or make sanity checks?
Last edited by foldinghomealone on Mon Feb 20, 2017 9:51 pm, edited 1 time in total.
-
- Posts: 230
- Joined: Mon Dec 12, 2016 4:06 am
Re: GPU utilization drops regularily
As you can see, every project is going to be different. Just relax and let it do what it does.foldinghomealone wrote:That's a perfect waste of timeComputerGenie wrote:That's perfectly normal.
...
P.S. - if you prefer the software to act differently than it's designed to act, then, perhaps, you should get on the team and get involved in a rewrite.
-
- Posts: 230
- Joined: Mon Dec 12, 2016 4:06 am
Re: GPU utilization drops regularily
That statement doesn't match reality. Permanently sustained high temps lower durability and longevity.foldinghomealone wrote:...I would prefer constant temps for durability of GPU...
-
- Posts: 128
- Joined: Wed Feb 01, 2017 7:07 pm
Re: GPU utilization drops regularily
ComputerGenie wrote:That statement doesn't match reality. Permanently sustained high temps lower durability and longevity.foldinghomealone wrote:...I would prefer constant temps for durability of GPU...
-
- Posts: 128
- Joined: Wed Feb 01, 2017 7:07 pm
Re: GPU utilization drops regularily
I don't demand things but I see room for optimizationComputerGenie wrote:P.S. - if you prefer the software to act differently than it's designed to act, then, perhaps, you should get on the team and get involved in a rewrite.
Re: GPU utilization drops regularily
There's absolutely nothing unusual about the first and third GPU. It's perfectly reasonable to assume that the dips you see are repeated many times at equal intervals but outside of the field of view.
In the middle image, a periodic pattern is harder to discern. There is no reason to be bothered by a variatino in the height/width of individual pulses. If the cache is mostly empty, it will look different that if the cache i mostly full of data that needs to sync to disk. (i.e.- the third pulse from the left compared to the first two,)
What's important here is that the total time each GPU is waiting on the HardDisk adds up to almost nothing . (That's why running FAH on a SSD is only a little bit faster than running it on a HD.)
In the middle image, a periodic pattern is harder to discern. There is no reason to be bothered by a variatino in the height/width of individual pulses. If the cache is mostly empty, it will look different that if the cache i mostly full of data that needs to sync to disk. (i.e.- the third pulse from the left compared to the first two,)
What's important here is that the total time each GPU is waiting on the HardDisk adds up to almost nothing . (That's why running FAH on a SSD is only a little bit faster than running it on a HD.)
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 128
- Joined: Wed Feb 01, 2017 7:07 pm
Re: GPU utilization drops regularily
Still, each time almost nothing adds up to a >2% longer processing time for each WU.bruce wrote:What's important here is that the total time each GPU is waiting on the HardDisk adds up to almost nothing . (That's why running FAH on a SSD is only a little bit faster than running it on a HD.)
For sure, 2% performance increase is nothing compared to waiting for next GPU generation.
-
- Site Admin
- Posts: 7990
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4 - Location: W. MA
Re: GPU utilization drops regularily
To write a checkpoint, the data structures describing the current state of the WU being processed that is being written out needs to be in a consistent and static state. Continuing to compute would not allow that. My assumption is that the calculations for the sanity checks needs that same static, consistent set of data structures.foldinghomealone wrote: What is the reason to stop GPU computing to write checkpoints or make sanity checks?
As for the necessity of doing sanity checks, that was found to be needed with the computational results on consumer level GPU cards. They are not optimized for stable numerical calculations like the "Pro" series of cards sold explicitly for that purpose.