Power optimization and number of cores

arisu · Post by **arisu** » Mon May 05, 2025 5:46 am

There's a nonlinear increase of power consumption with an increase in CPU frequency. Power will increase at a rate proportional to frequency cubed or more. For reducing the maximum amount of energy used by a single CPU instruction, it's always better to increase the number of CPU cores and decrease their frequency to stay within the package's power limits. In other words one core at 4 GHz and two cores at 2 GHz will perform the same number of computations in the same period of time, but the former will use significantly more power to do so.

But as we all know, GROMACS does not scale infinitely with the number of cores as load balancing and reconciling forces between slices adds more and more overhead. Also, more cores means smaller per-core caches and more cache misses. I don't know how relevant a hot cache is to FAH.

What is the sweet spot? Assuming no budgetary concerns, what is the approximate ideal number of cores for a HEDT/server processor if folding is the only activity it will be doing? Is there any data on this? I'm not sure how much I should trust lar.systems, and the GROMACS user forum is not at particularly representative of the kind of simulations that FAH performs.

muziqaz · Post by **muziqaz** » Mon May 05, 2025 6:49 am

No data, we just fold.
16 threads/8cores is a sweet spot. Past that performance increase is minimal, yet with bonus point formula it still shows point increase.
Gromacs craps out at random projects running on 64+threads.
It almost always craps out completely with anything above 96 threads.
CPUs scale better with new generations of products, which have improved FP pipelines. That stuff scales infinitely. Like, Zen5 has FPU which needs to be tweaked by compilers to be utilised properly in Desktops, yet FAH picked it up and said: No worries there, we will utilise this to the end of time

AMD 3d cache is loved by Gromacs. However AMD 9000 series came with tweaked FPU and cache subsystems and that thing blows 3dcache out of the water (or should I say matches and surpasses older 3d cache products).
Still, if there is a chance to buy 3d cache product Vs normal one for FAH, go with 3d cache.
And Linux as OS, since Windows scheduler is pants for FAH.
My 9950x3d on Windows is equal in output to my 9950x in Linux. This shouldn't be the case.
FAH and I suppose Gromacs hate Intel big.shittle architecture. You basically must fold only on P cores, otherwise FAH will pick up the slowest cores and drop P core performance to match them

arisu · Post by **arisu** » Mon May 05, 2025 9:58 am

It looks like the 9000 series is Ryzen. I'm looking at Threadripper or EPYC for the PCIe lanes. But I suppose any Zen5 has the architecture changes you speak of.

Then I suppose I was right in getting a Threadripper 7960X with 24c/48t. GROMACS won't shit itself and it should still have superior energy efficiency compared to a system with the same TPD but only 16 threads running at a higher frequency. If 48t is too much, I could always disable SMT.

The Intel big.LITTLE really is horrible, although GROMACS could just detect P-cores and only fold on them if it was programmed to. The GROMACS load balancer is not equipped to handle such drastic differences in performance, so it results in the E-cores all running at top speed while P-cores get only brief bursts of use, spending their time idle waiting on the E-cores otherwise.

I've found that the best way to make use of that architecture is to run one WU on the P-cores and one on the E-cores, assuming the system has sufficient cooling capacity to maintain boost rates on all cores, otherwise the heat from the E-cores causes the P-cores to throttle. But if cooling is sufficient, the E-cores can be used as well on a separate WU.

But I don't like Intel anyway. I prefer AMD.

foxpy · Post by **foxpy** » Mon May 05, 2025 1:12 pm

arisu wrote: ↑Mon May 05, 2025 9:58 am I've found that the best way to make use of that architecture is to run one WU on the P-cores and one on the E-cores, assuming the system has sufficient cooling capacity to maintain boost rates on all cores, otherwise the heat from the E-cores causes the P-cores to throttle. But if cooling is sufficient, the E-cores can be used as well on a separate WU.

In my experience, running a second WU on E-cores with a 6P+8E configuration limited to 80W reduces P-cores frequency by only 300-400 MHz. It is indeed more efficient to utilize more processors, just as you said :^)

BobWilliams757 · Post by **BobWilliams757** » Mon May 05, 2025 4:37 pm

arisu wrote: ↑Mon May 05, 2025 9:58 am It looks like the 9000 series is Ryzen. I'm looking at Threadripper or EPYC for the PCIe lanes. But I suppose any Zen5 has the architecture changes you speak of.

Then I suppose I was right in getting a Threadripper 7960X with 24c/48t. GROMACS won't shit itself and it should still have superior energy efficiency compared to a system with the same TPD but only 16 threads running at a higher frequency. If 48t is too much, I could always disable SMT.

The Intel big.LITTLE really is horrible, although GROMACS could just detect P-cores and only fold on them if it was programmed to. The GROMACS load balancer is not equipped to handle such drastic differences in performance, so it results in the E-cores all running at top speed while P-cores get only brief bursts of use, spending their time idle waiting on the E-cores otherwise.

I've found that the best way to make use of that architecture is to run one WU on the P-cores and one on the E-cores, assuming the system has sufficient cooling capacity to maintain boost rates on all cores, otherwise the heat from the E-cores causes the P-cores to throttle. But if cooling is sufficient, the E-cores can be used as well on a separate WU.

But I don't like Intel anyway. I prefer AMD.

You might find a below link interesting. This blog is done by folder Paragon.

https://greenfoldingathome.com/2 ... ation/

arisu · Post by **arisu** » Mon May 05, 2025 10:02 pm

That's a great blog! I've read his post about the 3090 but I forgot about his site. Thank you!

arisu · Post by **arisu** » Mon Jun 02, 2025 3:41 am

What about folding different WUs on each CCD? The 7960X has 4 CCDs each with 6 cores (12 threads) for a total of 24 cores (48 threads). Would there be an advantage to folding 4 WUs, each with 12 threads and pinning each WU to one CCD, or would it be better to have a single 48 thread WU spread across the CCDs, despite them having NUMA-like properties?

foxpy · Post by **foxpy** » Mon Jun 02, 2025 4:50 am

Well, it is not quite like NUMA. Yes, synchronization across CCDs can be expensive, but memory access speed should be uniform, if I understand it correctly?

muziqaz · Post by **muziqaz** » Mon Jun 02, 2025 6:50 am

Optimal would be 2x 24threads. Not because of CCDs, but because core_a8 does not scale well past 30 threads.
FAH doesn't really suffer too much from cross CCD sync

arisu · Post by **arisu** » Tue Jun 03, 2025 5:09 am

Thank you! Will probably do 1x 24 and 1x 22, to leave room for a few GPU threads (I've got them to play nicely on just one physical core).

Folding Forum

Power optimization and number of cores

Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores

Re: Power optimization and number of cores