Ryzen 9 3950x Benchmark Machine: What should I test for you?

A forum for discussing FAH-related hardware choices and info on actual products (not speculation).

Moderator: Site Moderators

Forum rules
Please read the forum rules before posting.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by MeeLee »

There's got to be an explanation as to why the 3950x runs lower PPD on between 16-25 threads. It's either running on lower frequency, or something else inside the system is slowing down (IF,...).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by bruce »

FAH makes very heavy use of the FPU to do FP32 operations. Making use of HyperThreading/SMT means there will be additional inter-thread competition for FP32 resources, whether or not there is a change in temperature which alters the effective computation rate. The second "half" of a HT pair contributes less resources than the first "half" because the shared resource is busy much of the time.
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by _r2w_ben »

I don't have a high core count machine to test on, but this is my theory:
At 16 and 32 threads, the OS is preempting FAH more often than when there is 1 physical core free.
In the 17-23 range, 1 of 17 (5.8%) to 7 of 23 (30.4%) of threads are bouncing around taking CPU time from 16 threads allocated to a physical core. Scheduling this efficiently is challenging. Since the threads need to synchronize to exchange data regularly, the slowest thread (allocated the least CPU time per second) determines overall performance.
24-31 still has scheduling challenges but since there are more threads, there should be less frequent migration between cores. Most physical cores are juggling two threads so all threads move forward at a more steady pace.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by MeeLee »

The 3950x has 8x quadcore chiplets, that in most motherboards at least 4 can run independently of one another in terms of CPU frequency (unless you select all core load in the BIOS).
My experience, with up to 16 cores, the CPU gets enough power from the VRMs, to run high frequencies.
16-25 cores, and the CPU sees a non-100% load, so lowers the CPU frequency. Haven't tested it yet on a non PBO enabled setup, but with PBO, that's what my motherboard does.
Then once you surpass 25 cores, the motherboard probably overrules the CPU TDP of 105W, and gives the CPU all it's got, so all thread get another frequency boost, thanks to the extra power the motherboard feeds the CPU beyond the AMD specifications.

I think this might be what happens, but I can't confirm, since I don't have an AMD system running right now (offline for maintenance).
ericmonroes

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by ericmonroes »

I cant understand the difference between power and energy in this case. I found only this review https://differencebtwn.com/difference-b ... -vs-energy
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by bruce »

Let's dig a hole. You move one shovel-full of soil from the bottom of the hole to the surfae above. You expended some energy.

Now we give you 30 to 60 minutes. If you work too fast, you get tired and slow down. You're expending your energy at a certain rate. Doing so for 30 minutes or 1 hr is the power you're providing.
Starman157
Posts: 30
Joined: Tue Jul 14, 2020 12:55 pm
Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by Starman157 »

Has the author ever finished the analysis of the 3950x along with GPU folding? Was part 4 of the report ever released?

One thing I didn't see in the reports was a PPD/Watt/Processor (logical or real) analysis. I suspect that the 15 core level is the most efficient and anything in the 17-32 range would show the resource contention for FP32.

Also, speaking from experience with a 3950x myself, along with a Radeon 6900xt, resource contention takes on a whole new meaning when trying to keep the Radeon fed. It gets to the point where the drop in performance on the 6900xt is equivalent to the entire 3950x contribution, so in essence you're fighting yourself.

Also, the 6900xt is sorely underutilized when atom counts are relatively low. As an example, 89,000 atoms on WU 13444 yields about 3.3 million PPD (at an average GPU utilization of ~75%). Bump the atoms up to 200,000 (WU 17323) and the PPD jumps to over 5.1 million, with an average GPU utilization of ~92%.

I guess this is the problem of having 5120 shading units.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by MeeLee »

Starman157 wrote:Has the author ever finished the analysis of the 3950x along with GPU folding? Was part 4 of the report ever released?

One thing I didn't see in the reports was a PPD/Watt/Processor (logical or real) analysis. I suspect that the 15 core level is the most efficient and anything in the 17-32 range would show the resource contention for FP32.

Also, speaking from experience with a 3950x myself, along with a Radeon 6900xt, resource contention takes on a whole new meaning when trying to keep the Radeon fed. It gets to the point where the drop in performance on the 6900xt is equivalent to the entire 3950x contribution, so in essence you're fighting yourself.

Also, the 6900xt is sorely underutilized when atom counts are relatively low. As an example, 89,000 atoms on WU 13444 yields about 3.3 million PPD (at an average GPU utilization of ~75%). Bump the atoms up to 200,000 (WU 17323) and the PPD jumps to over 5.1 million, with an average GPU utilization of ~92%.

I guess this is the problem of having 5120 shading units.
I can only speak for Intel CPUs, I've never really folded on AMD CPUs (don't have a job now, no income, so I don't have the luxury to run a system non-stop).
Anyway, on Intel CPUs, even if you're running in resource contention, FAH seems to score higher bonus points, if you run more threads.
On my Xeon 10 core, with 20 threads, the score was higher at 20 threads than at 10 threads.
Same on my 4 core 8 thread core i5.
And especially on my Core I processors, when running 1 out of 12 threads, and adding threads one by one, it appears that PPD increases drastically with each thread added (to about 2 threads from an all core/thread load).


For AMD, there are many variables.
I'm only speaking compute here, not FAH specific.
I'm guessing (especially the 5000 series), memory might become a bottleneck if you are running more than 24 threads, so only on the 16 core 32 thread CPUs.
I've seen charts where the last 4 cores didn't really add much to the performance, but it may have been due to thermal limitations.

Having anywhere from 24 to 32 threads pull data from the memory at ~3.8Ghz, when the memory itself is only 3.6Ghz, could in theory cause a bottleneck, if fah crunches through data faster than the CPU cache can be filled up with.
Then again, those wait states per core cause an overage of power, which can be routed towards other cores/threads that are actively crunching data, allowing those threads to run at higher boost frequencies; while a few of these idle threads at any given time, will also help keep the cpu cooler (which in turn allows the CPU to select higher boost frequency tables).
I would say, if you run slower memory and IF frequencies (the first Ryzen 3000 CPUs sometimes couldn't run the IF past 1800Mhz), it might be better not to run more than 16-24 threads.
For faster memories (3600Mhz or greater) with IF maxed out (especially on 5000 series where the IF reaches higher frequencies), I believe SMT enabled (<32 threads) should be faster than SMT disabled.


So there are 2 theories, with my personal presumption being:
The CPU probably keeps TDP at bay, so you'll be running slightly more efficient at SMT enabled, vs disabled; while other parts of the CPU (mainly the Infinity Fabric and memory access) will benefit greatly (and run more efficiently) with SMT disabled.
This will lead to a cooler running CPU, though the CPU will boost to higher frequencies, negating any power consumption gains you got from disabling the SMT; resulting in SMT disabled or enabled, will still run the CPU at the same TDP.
Only when SMT is enabled, the cores themselves run at lower boost frequencies than with SMT disabled, which by default makes them run more efficient (faster at ~the same TDP).

On the other hand, If you lock the CPU to a fixed frequency eg: 3.8Ghz, and disable the auto overclocking feature, the system will most likely run much cooler with SMT disabled; but it'll also run much slower.
So if you set your boost frequency manually, to a fixed speed, it's best to keep SMT enabled.
Until then, we still need the first guy to actually do the test on CPUs.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by bruce »

[quote="MeeLee"]UntilGood theories .... awaiting confirmation from testing.

Regarding RAM speed constraints, does that CPU have faster VRAM somewhere other than the cache or does it always use main RAM [+cache]? (AMD likes to advertise that that's and advantage, and it might not be for FAH.)
Starman157
Posts: 30
Joined: Tue Jul 14, 2020 12:55 pm
Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by Starman157 »

I've found something interesting.

When using SMT, it would appear that the windows scheduler, at times, chooses to use the second logical processor on a core rather than use an unused core. This effect is noticeable in the task manager when looking at the individual core loading. I suspect that this is the lowering effect of performance when folding with more than 14 threads or so. The Windows scheduler thinks that assigning the additional threads to already saturated CPU's is a good thing, most likely derived from the relative "quality" score of each of the cores in a CCX. I'm thinking that this isn't helpful as an unused core is far more desirable than using something that is already saturated. When running the same workload on two threads (one core), SMT trickery isn't going to fix contention issues. It'll try it's best to get things done, but it's NOT the same as an unused core. I dont' know if this effect applies to Intel processors, as I don't have one to try it out on. I'm also not sure if this "effect" is also found in the Linux scheduler, as I haven't been successful in getting AMD graphics cards to operate in that OS.

When fahcoreA7 (CPU folding) is forced to operate only on "real" cores, the ppd/lp (points per day per logical processor) goes up, by a lot. 400,000 PPD when forced (or better depending on WU) from ~275,000 PPD (I've seen it lower based on WU) left to the "wild west" of windows all "cpu" scheduling. An improvement of 140,000 PPD! Or to put it another way, a jump from 17,500 PPD/LP to 26,000! (I've seen it even higher depending on WU)

I've done this by forcing the affinity (in Windows Task Manager) of Fahclient to only real cores by removing the second logical processor assigned to the core. In essence, removing all odd numbered "CPU"s does this. Note: I don't want to remove SMT from the system, as it comes in handy fitting in all of the other sundry services that Windows needs.

I suspect this method ensures that the CPU folding isn't fighting for FP32, or any other, resources, since after all, there's only one real execution unit per 2 logical cores and I'm almost ensuring that FAH is running on that execution unit. I'm guessing context switching isn't a good thing too.

So in essence, I'm running my 3950x CPU folding at 15 cores, on even numbered "CPU"s, leaving one core (2 logical processors or "CPU"s) for Fahcontrol, and other sundry windows stuff just to keep them out of the way of CPU folding. FahcoreA7 (CPU folding) will inherit the CPU affinity of Fahclient, so this is where it's done. But here's the problem. Fahcore22 (GPU folding) also inherits the Fahclient affinity, and of course if you've already guessed, GPU folding will find the CPU's already busy with CPU folding work. This kills performance of GPU folding (not to mention the minor annoyance to CPU folding).

So as a feature request Bruce (and team) is it possible to add separate CPU and GPU affinity control to FAH? Something buried in Fahcontrol under the configure slots section? Just default to all CPU's for those that don't know anything about affinity. Add something where the "CPU"s can be assigned individually would be great. I know this request sounds simple and is most likely fraught with all sorts of Windows complexity.

The CPU runs cooler, and it's far more efficient as it's using less power, which helps with PBO. (Note: I've tried manual OC and found that I have no where near the same fine grained control over clocks and voltages as PBO does! OK, PBO it is)

Sometimes brute force method isn't always a good thing, which I suspect is why the PPD improves around 24 threads up to 31. Everything is running in contention by the time 24 threads are used, so why not brute force more contention by consuming more power. That lowers CPU clocks as it tries to deal with the heat. Ever notice in the charts provided by the author of this forum post that the slope of improvement from 16 to 32 threads is far shallower than the climb from 1 to 15? Contention and clock lowering.

Anyway, just a heads up that CPU folding has a lot of additional performance still to be realised, while at the same time efficiently performing GPU work.

Note: I have a screen capture of the various effects but haven't found an easy way of inserting it into this reply. It demonstrates the improvements and problems.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by bruce »

Most of what you documented has appeared before, though it's difficult to find. One FAH task per physical core is more efficient than adding more due to FP32 contention.

Over the short-term, FP32 processing load may seem uniform, but it's not. There's a mixture of non-FP32 workload, too. Over the long-term, there are other nonuniformities, too.

For the most part, we can assume that the processing work that the FAHCore distributes to the threads are identical ... though there may be exceptions. e.g.-Bonded and non-bonded calculations are different and may be assigned to different threads.

Assuming uniformity [??], the overall speed would be limited by the slowest thread. If you have N physical cores with 2N threads and you distribute work to N+1 threads, the physical core that's running the extra thread will be be processing the two slowest threads. Periodically, GROMACS attempts to balance the workload by shifting some atoms from the slowest threads to faster ones. This can result in unpredictable performance changes as Windows scheduler decides to move things around too.
Starman157
Posts: 30
Joined: Tue Jul 14, 2020 12:55 pm
Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by Starman157 »

Yes, I've noticed the shifting too. Even though I've reported the increased PPD per core (or aggregate), the changes in WU calculations as well as the rebalancing can have the total PPD vary by +/- 20,000 PPD from the 400,000 I reported. However, this varying also affects the non affinity tuned work as well, but I've never seen Gromacs ever change it's work to a non used core/thread. If 15 threads are used, and Windows in a non affinity tuned environment has decided that using 2 threads on core 1 (CPU 0 and 1) is a good idea, I've never seen the thread on CPU 1 (2nd thread) ever change off and over to some unused core. I suspect that Gromacs doesn't have the capability to assign where the threads get used; that's up to the Windows scheduler.

As for other non-uniformities, I've noticed that GPU processing WU 13444 has higher performance for the first 27%, then drops about 200,000 PPD for the next 25%, rises again to starting values for the following 25%, then drops 200,000 PPD again to the end. At least that is on the 6900xt. So it appears non uniformity isn't just a small scale issue, but can appear to be long term and vary depending on where in the WU processing is. What I was trying to figure out was whether simultaneous CPU processing was causing it. It wasn't. I guess it's just a feature of that WU.

Again, is affinity tuning something that can be implement into the FAHClient configuration?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by bruce »

Gromacs rebalancing doesn't know when (or why) the Windows scheduler might have made changes. It only knows what imbalance can be seen on its internally benchmarked threads. Since the logic that you're seeing is a result of two different blocks of code, you can wish they communicated with each other but they do not.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by MeeLee »

bruce wrote:
MeeLee wrote:UntilGood theories .... awaiting confirmation from testing.

Regarding RAM speed constraints, does that CPU have faster VRAM somewhere other than the cache or does it always use main RAM [+cache]? (AMD likes to advertise that that's and advantage, and it might not be for FAH.)
As far as I know, only Ryzen 3000-4000 series with a 'G' (or 'GE') on the back have an IGP built in.
The IGPs pull the data straight from the IF (Infinity Fabric) so, it is entirely possible for the IGP to pull some of the data directly from L-Cache when it's loaded in L-cache.
For all other data, it'll use regular VRAM.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Post by MeeLee »

@Starman157:
As to why Windows would select 2 cores right next to one another:
On the Ryzen CPUs, especially on the 3900x and 3950x, each CPU is divided into chiplets.
Each chiplet (or CCD) is given an L-cache block.
If data that needs processing needs some of the data available in the L-cache connected to the first 4 CPU cores (8 threads), it will try to utilize one available CPU out of those 4.
Using eg: Thread 9-32, would mean that the data needs to be written from L-cache 1 to L-cache 2 (or 3 or 4), and would cost extra compute cycles.
This is why threads are assigned in such a way.
If you are only using up to 75-80% of your Ryzen CPU's threads, the CPU will go into a power saving mode, where it tries to shut down cores, but later has to reactivate them, which costs power.
This is why you're seeing normal power consumption levels, but reduced CPU frequency.
Ryzen CPUs run best at either an all thread load, or an all core load.
If you want to test out the performance of core vs thread, it is best to disable SMT in the BIOS, rather than do the taskmanager's affinity's route, and do the test on as many threads as you can (don't load below 75% of the cores, unless you are running GPUs that will fill up some extra threads on the CPU).

Your results will depend from system to system.
Systems with better cooling, and higher quality and power VRMs (usually paired with 3x 4-pin CPU power plug, vs only 1x or 2x 4-pin on the motherboard), will be able to push the frequencies higher with PBO enabled.

What makes this all difficult to measure is:
1- The manual overclock, power and voltage settings in the bios on the CPU
2- The automatic settings ruled by AMD's BIOS software
3- The motherboard manufacturer's tweaks to the board on the CPU
4- The power settings of the OS
5- The cooling capabilities of the CPU
6- Pausing a WU will cause a drop in PPD.
7- WUs in between themselves have different PPD ratings, on the same hardware.
8- Background activity
...
Just to name a few..

Which is why it's better to go with a 3 fan (360mm) water cooler, or high efficiency air cooler, good case cooling, and do tests on Linux vs Windows, as Linux has fewer background activity.
Also, understand how your core and thread changes affect internal power profiles, CPU frequency changes, etc...

Ryzen is by far the most complex CPU to benchmark properly, because many of the BIOS settings counteract other settings (of the manufacturer or AMD's own settings).
Post Reply