Page 1 of 2

2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 3:02 am
by schapman1978
I'm noticing some unusual second GPU slot activity tonight. It's working on WU project 14564 when I noticed this. Not sure if it's WU dependent or what. I have a pair of 2080 ti's folding and the first one's copy activity and CUDA usage is pretty flat and level as you can see in the images. The second one loads up for a few seconds, then drops off, then repeats over and over and over. I can hear the light coil whine, then it stops. Then restarts. It's not unusual to hear this but when I started looking at the card activity closer, I'm seeing these patterns. There is no NVI-link between them and they are not set up in SLI. They are both holding steady clocks and the second card is doing this whether its stock, under, or overclocked. It will grind on like this indefinitely but I'm trying to understand why it's loading up and dropping off. Then repeating. It causes a spike of about 125-150W each time. It's strange to me. Any thoughts? Reinstalled latest drivers, restarted client, OpenCL options at default.

Sorry for the zoomed in size - Imgur did something weird or I got the code wrong to link it right.

Image
Image

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 4:00 am
by PantherX
It seems that your GPU is being starved of CPU Cycles. I would suggest that you pause whatever is causing your CPU hit 100% usage and see if the issue goes away or not.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 4:24 am
by MeeLee
Interesting.
There's always some up and down motion, as the GPU writes to VRAM, and waits for PCIE data.
Your first GPU shows this.
Are you running PCIE Gen 2.0 on your Motherboard (running DDR3 RAM; GPU-Z can tell you).
If so, you'll need to configure your system to run at x8 speeds on both GPUs. Running below this, might starve the GPU on PCIE bandwidth.
If you're running PCIE 3.0, you'll preferably have x8, but x4 speed should work as well.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 6:10 am
by Rel25917
Do the dips coincide with every 1% or so of the workunit? Could just be a dip while it does its checkpoints.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 7:32 am
by schapman1978
I paused the CPU entirely in FAH and changed it from 28 threads to 26 and 16 etc and it still occurs. If the CPU is fully paused in FAH the valleys are much more shallow. It’s an AMD 3950x fwiw. If I paused the cpu from folding it runs at about 1-3% usually handling tasks.

I’m also running 32GB DDR4-3600 at stock CL16 timings and have dual PCIE 2080ti’s which run natively at 8x/8x on this X570 board. I have a pcie 4.0 m.2 drive in slot one but I’m wondering if having a second m.2 on the second slot might be shorting bandwidth to card 2 for some reason - it’s a gen 3 pcie m.2 and that m.2 is run by the x570 chipset so it shouldn’t since 4 lanes are dedicated to the chipset by the cpu - but I’m willing to pop it out and see if you think that makes sense. I keep wondering if maybe because of the second m.2 if maybe the second gpu slot is going 4x or something.

I’ll check in bios if it’s posting 8/8 when I get up.

And unfortunately the 1% dips aren’t coinciding with the constant dips. I wish I could crunch stuff that fast lol... this is pretty rhythmic and every 5-7 seconds or so I’d guess by memory.

Good ideas - we’re thinking similarly. Open to anything.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 7:41 am
by bruce
The frequency of the checkpoints is defined by the Project Owner.

FAH runs what is called a "sanity check" which gives the analysis a chance to be aborted if the WU is, in fact, unstable. The actual GPU process is suspended briefly when this is processed. It is probably synchronized with the checkpoints, but I'm not sure if that's always true.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 7:45 am
by PantherX
Get GPU-Z and see what the PCIe utilization is and also the speed that it is operating at from within the OS.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 7:54 am
by schapman1978
Well, I'm up anyway for other reasons - it's still doing it - but in a twist - it's now doing it on the physical first slot card - not the second one. I also screenshotted my GPUz screens showing them at 8x/8x - I guess that defeats my possible 2nd pcie m.2 bandwidth theft theory. Hmm...

Sorry for delay pics sizes are terrible I'm working through the coding
Image
Image

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 8:01 am
by uyaem
Are the GPUs working on different projects now, could it be a project-specific "glitch" (intentional or not)?

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 8:11 am
by schapman1978
It also appears that both GPUs can do it at the same time. I wonder if it's just the sanity check happening every 5 or so seconds and might be normal? I've always heard the intermittent noise breaks of the cards even in single or double configuration but never investigated assuming it was normal behavior. My checkpoints are set at 5m manually but this is like clockwork every 5 seconds. Looks like I finished a unit and picked up another - they're both now folding a piece of 14564 and both are dipping like that. I wonder if it's just the project ?

Image

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 8:15 am
by PantherX
For GPUs, the value you configure for checkpoints is ignored. However, for the CPU, the checkpoint value applies. In the case of GPU checkpoints, the researcher sets the checkpoint interval which can vary from 2% to 5% IIRC.

The drops every 5 seconds is weird. Can you try pausing GPU01 and observing GPU02. Then pause GPU02 and unpause GPU01 and observe what happens. I would observe each attempt for 5 minutes.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 8:25 am
by schapman1978
Good idea - I did a short 30 seconds of pausing 1 gpu and also pausing the cpu too with it so only 1 gpu was folding at a time (each scenario for about 30 seconds.) Then I repeated that test of the other GPU running with the CPU, then with the cpu paused. It exhibits the same spiking behavior. The only difference, regardless of which GPU is running this unit, is that if the CPU is totally paused (not reduced core usage but fully paused) either or both cards ramp up a few % points for usage and the dips become significantly less severe. But they always happen on time like a metronome - I think it might be programmed to run this way. Not sure. I'll try a longer sample test but I expect this behavior to persist.

I'm also going to reboot everything and see if it replicates. I'm only paying attention because I just jumped to Win10 Pro from Win10 Home tonight before I went to bed. OS swaps always makes me paranoid at first. Too bad I can't fire up a VM and run it in ubuntu or something to see if it's the same.

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 8:35 am
by PantherX
I am running Windows 10 Pro 1909 64-bit and this is what it looks like:
Image

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 8:41 am
by schapman1978
Right on - I'm going to let these two chunks finish and wait until I get a new project and see if it's just this particular project. Both my GPU's are folding 14564 and exhibiting this behavior after a reboot still. I've had other projects where it was a nice level line (with minor variations like yours) on other projects. If a different project doesn't replicate this issue, maybe I should post something in that "problems with a particular WU" thread? I don't know if it a problem or not by standards. It appears its going to finish them, but maybe this optimization is causing it to take a lot longer than if it wasn't faceplanting both GPU's every 5 seconds. I'm not a programmer - I'm just thinking out loud.

Here's a GPUz sensors shot with it showing info for both and task manager on the same shot - it does this with the GPU's clocked up some or at default settings. They just fold slower and a little quieter at default clocks but spikes exist.
Image

Re: 2nd GPU Spiking Up and Down

Posted: Sat Apr 25, 2020 9:26 am
by schapman1978
Yeah I realized it's doing the exact same thing on another workstation I'm folding on here that's a single 2080 setup. Identical behavior with or without CPU running like this machine. I went ahead and put a thread up in the WU section for the owner to take a peek at. I just got my 6th chunk of 14564 and its doing the same thing. So far, it's been on
(1440, 0, 1)
(1251, 0, 2)
(341, 0, 2)
(1318, 0, 1)
(745, 0, 3)
(225, 0, 4)

Link to that thread here viewtopic.php?f=19&t=34797