Ryzen 9 3950x Benchmark Machine: What should I test for you?

Paragon · Post by **Paragon** » Sun May 17, 2020 1:05 pm

Hi all,

I'm looking for suggestions for things to test out on my new Ryzen 9 build for my blog on folding@home performance and energy efficiency. Currently I am sweeping through the # of threads settings for CPU folding, logging PPD across multiple work units as well as system power consumption. This should provide a lot of insight on how high core count machines can be optimized. Right now, I am making these plots:

PPD, Power Consumption (Wall), and PPD/Watt plots for 1-32 threads (SMT Enabled)
PPD, Power Consumption (Wall), and PPD/Watt plots for 1-16 threads (SMT Disabled)

I am also going to do these with Core Performance Boost disabled, to show the effect of the automatic frequency boosting at lower CPU core loads. I'm trying to get a minimum of 3 work units for each setting, to average out some of the work unit PPD variability.

Other ideas people have pitched so far:
1 AMD Infinity Fabric underclocking / undervolting
2. System memory --> disable XMP profile and run at 1.2 v instead of 1.35 v

Things I have tested so far:

* GPU Folding: Impact of the background system on GPU performance (I ran this new build with the 1650 graphics card I had previously tested in my PCI-Express 2.0 old machine, and saw no difference in production now that it is in PCI-Express 4.0). Note: I want to redo this with a faster card (my 1080 Ti) to see if the fast card gets a benefit of PCI-E 4.0.

* Effect of system power supply on efficiency (got a 4% efficiency improvement by going from 80+ gold to 80+ titanium)

* Effect of Core Performance Boost on GPU folding (no difference in production, but shaved 20+ watts off of system power consumption by shutting this off and resulted in the CPU running at 55 degrees C vs. 70 while GPU folding).

You can read my initial article here.

https://greenfoldingathome.com/2020/05/ ... yzen-time/

Let me know what other things I should be testing out / can help with.

holstien · Post by **holstien** » Sun May 17, 2020 1:28 pm

Does the 1080Ti support PCIe 4 lanes?

I would be curious how it scales about multiple faster OpenCL cards (both nvidia and amd). My experience with the 3900x is that some types of work units need more than one thread free to keep fast AMD cards fed on linux. Sometimes 1 thread is just fine feeding my gpus, and other times it seems like 3-4 is what it takes for full utilization. This is not a consistent state: some combinations of CPU/GPU work are fine with 1 thread free. I believe this is could a hyperthreading related issue. Maybe there are some interrupter timers that could be optimized? I'll think about it.

The tests i am proposing is try running 2 fast gpus with with 1 worker thread, then scale to 4, the turn on hyperthreading and try again. Maybe switch between red and green vendors and try yet again. The hypothesis is that the hyperthreaded gpu workers suffer timing latencies when the cores are loaded that result in poor gpu output.

("holstien, why don't you test this yourself?" I'm so glad you asked! The weather is getting warm and I only fold when it's cool/at night now. If I do it we'll have to wait until September.)

BobWilliams757 · Post by **BobWilliams757** » Sun May 17, 2020 1:30 pm

This makes me laugh. I was just seconds ago reading this recent blog when researching some of your GPU articles, and wondering why you hadn't posted info on your new rig here.

Having already read the improvements in efficiency you've made to the new rig, I'd suggest staying on that same track. It seems most efforts until now have been in GPU performance and efficiency, while CPU efficiency hasn't been noted as much. Though it seems common that people reduce power on GPU's, more people seem to do less in the way of CPU's. I suspect that overall the efficiency increases will still be less than what people find with GPU's, but with some of the power hungry CPU's on the market it could all still add up.

It seems like you already have a fairly good lineup in the works for some testing, so I've got nothing to suggest at this point.

BUT... since most of us don't buy a Kill-O-Watt or similar, just an efficiency related question. Do you use any software to monitor your system, and if so, could you post what you are finding on the Ryzen power consumption figures vs at the wall? Since HWMonitor and others report package powers and such incorrectly, I'm trying to get some better numbers of what offset to use. That combined with my MB overhead and PS efficiency at low power draw gives me a fairly good picture of how it's impacting my power consumption.

MeeLee · Post by **MeeLee** » Mon May 18, 2020 12:48 am

What would be interesting is to see how CPU WUs run as one slot, or running each WU in their own thread.

As far as PCIE 4.0, the 1660 only runs at 3.0. To be PCIE bandwidth limited, you'll have to run it from an x1 slot.
You won't notice much performance difference, when running from a faster than 3.0 x1 slot.

Infinity fabric would be interesting, as well as memory, to see how it affects the speed of folding.
I bet the numbers will be affected only by a little, as most of the data is read from it's rather large L-Cache. But surprise us!
The 3950 should have 1Tflops, which should equal the performance of a GT1030 (65k PPD), but I think some people have reported 100k PPD on their 3900X.
Would be interested to see the ratings!

What I can tell you, is that the Ryzen 9 3000 series, run best at a fixed clock setting. You can shave off a few Watts, by finding your best clock, and then fix it to that.
(Eg: 3800Mhz, and then disable PBO)

_r2w_ben · Post by **_r2w_ben** » Mon May 18, 2020 5:24 pm

Total output over a 24 hour period of the following configurations would be interesting:

1x32 CPU slot (SMT Enabled)
vs.
2x16 CPU slots (SMT Enabled)
vs.
1x16 CPU slot (SMT Disabled)

A record of the count and project # of completed work units would be good. Then compare the base points completed and average system PPD including the bonus over 24 hours. This should help to test scaling and dividing a CPU into smaller slots vs. the biggest slot possible.

Post by **PantherX** » Mon May 18, 2020 7:15 pm

My guess would be from Highest PPD to lowest PPD assuming same Project is used in all the tests and the CPU frequency remains the same:
1x32 CPU slot (SMT Enabled)
1x16 CPU slot (SMT Disabled)
2x16 CPU slots (SMT Enabled)

Reason is that folding is physical core hungry (FPU to be exact IIRC) so SMT does provide some improvement but not as much as adding a physical core. With 2 WUs, you would be struggling to share the physical cores between two of them which will cause a slow-down in folding. I would be keen to see the results

lafrad · Post by **lafrad** » Tue May 19, 2020 1:32 am

My guess would be:
1x27 CPU slot (SMT Enabled)
would be the higher PPD, due to less thread contention with the OS and various other things running.

MeeLee · Post by **MeeLee** » Tue May 19, 2020 3:18 pm

PantherX wrote:;...
1x16 CPU slot (SMT Disabled)
2x16 CPU slots (SMT Enabled)
...

I think the order makes sense if you're getting more PPD from QRB on the SMT disabled.
While SMT enabled does more work per time frame, but you'll get a reduced Quick Return Bonus.

If SMT disabled gets higher PPD than with SMT enabled, then an additional benefit is that you'd get the increase in PPD, at a much lower Watts.
Yet another inconsistency in the QRB system.

Paragon · Post by **Paragon** » Sat May 23, 2020 1:55 pm

Thanks for all the replies! I can see there are endless possibilities for testing here. One thing I've found is that this is going to be an ongoing adventure, since it takes so long to even get one data set across these 32 threads.

Here is part 1, the simplest test: 1-32 cores, just pulling PPD out of the client. The plot everyone wants to see is at the bottom. Work unit variability messes up the plot towards the high end, but I think the trend is pretty clear. Also, seeing a CPU do over 400K PPD is pretty nuts.

I am currently running this all over again, logging power as well as running multiple work units per core setting. I think some averaging will clear up the trend.

https://greenfoldingathome.com/2020/05/ ... f-threads/

BobWilliams757 · Post by **BobWilliams757** » Sat May 23, 2020 2:48 pm

Really interesting plot. The way the consistency drops off as thread count goes up might just show how complex the various projects are in variance and how a CPU deals with them.

In this instance, I'm glad I'm not you. Looking at that plot I'd have so many combinations to try that I could keep the machine running for months just testing all the suspicions I would see in the plot. I think the first would be configuring 3x slots at 10 threads each to see what happens vs a single slot at 30-32 threads.

It's also amazing how efficient (in PPD) the CPU is at low thread counts, especially considering the boost is off. I would suspect with the boost enabled, it would better my system (2400G) on only a couple of threads. And using 6 (the max on my machine), it would easily generate 4x the points. That's a huge change in generations of CPUs.

_r2w_ben · Post by **_r2w_ben** » Sat May 23, 2020 2:56 pm

Great write up Paragon! The Task Manager screenshot at the bottom is interesting. It's good to see that Windows is scheduling work on every other core so there is less sharing. As the threads reach a synchronization point and those running on a dedicated core finish first, do you see the load spread to different cores?

You can save a bit of time by skipping some CPU counts. FAH automatically reduces the number of CPUs to avoid primes and logs that change. Here's a list you can use as a reference:

Code: Select all

1
2
3
4 = 5
6 = 7
8
9
10 = 11
12 = 13 = 14
15
16 = 17
18 = 19
20
21 = 22 = 23
24
25 = 26
27
28 = 29
30 = 31
32

That leaves 20 out of 32 counts that FAH allows and should cut down your testing.

MeeLee · Post by **MeeLee** » Sat May 23, 2020 5:13 pm

The problem with the 3900x and the 3950x, is that if you have a motherboard with a 10 VRM count, they usually don't have enough power to run the CPU at their rated frequencies.
MSI sells boards that have 12VRMs, and CPU uses a 3x 4-pin power connector from the PSU.

Second issue is that you'll possibly need to set the CPU frequency to fixed.
As if you're using PBO, your frequencies are all over the place.
And quite often you can gain 25-50Mhz just by going fixed frequency.

Third, if you have a 10VRM board, your board needs a VRM cooling fan, if you want to push the CPU beyond 3,9Ghz.

Fourth, Ryzen 9 3000 series CPUs are very speed dependent on the RAM.
If you run them with stock 2133Mhz RAM, they perform much worse than at higher frequencies.
Tests done online, state that they run best with DDR4 3700Mhz RAM modules, as RAM frequency, and Infinity fabric operate at the same frequency.
Meaning, most older Ryzen 9 3000 CPUs had an Infinity Fabric that could do <1800Mhz.
Newer chiplets have been slightly optimized and can run the Infinity Fabric at <1900Mhz.
Since RAM speed is Double Data Rate, a 3600Mhz module actually operates at 1800Mhz.
Your Infinity Fabric speeds should be linked to the RAM speeds (don't set them to auto).
Although, there is a debate whether or not, IF set to auto might slow down the bus ring, resulting in less heat, and higher core boost frequencies.
The con, is that data will be read slower from RAM.
Which is why most BIOS versions allow a 'performance' setting on the IF, which means they'll run at their max speed, all the time.
The heat penalty is minor, so long the system runs stable.
On average IF can be overclocked to 1850Mhz safely (not on older CPUs), then paired with DDR4 4000Mhz memory you can get the most out of your system by setting it to 3700Mhz, and lower CAS latencies.
But if the memory is too costly (they're hard to get nowadays), Amazon sells 3600Mhz modules for $75 (for 2x8GB sticks), which can safely be overclocked to 3700Mhz using the same latencies as at 3600Mhz.

Once the IF is optimized, you can optimize the CPU speed, for a more consistent PPD readout, less affected by ambient temperature fluctuations that affect CPU temperatures and speed.

Once you have the PPD results of all core/threads, you'll have to redo the CPU tuning with SMT disabled, as you probably will be able to overclock to higher CPU speeds.
It's a tedious project that will probably keep you busy for an entire day, just to get the voltages, and PBO settings correctly, without running the CPU in excess of 90C (60-80C preferably).

But (stable) manual overclocking offers much more consistent results than with PBO.
Trying to determine PPD with PBO enabled, will not only lower your PPD, but will also be very inconsistent.
Almost as if you're trying to get an average on a random number generator.

Post by **PantherX** » Sat May 23, 2020 7:40 pm

Great write-up and I look forward to the comparison between SMT on and off.

Back when i7-860 was released, I tested it using the same WUs* to see if the HT on/off would make a difference. I discovered that there's a 12% to 25% reduction in TPF when going from 4 threads to 8 threads (IIRC).

I have two suggestions:
1) Can you please compare the TPF from the same WU*? IMO, PPD is great but what I look at is TPF as it provides a better representation. PPD will vary on internet speed and any server issues (fingers crossed, there won't be any outage).
2) You can capture 2 WUs*, one from a small Project (14677) and one from a large Project (14236) to see how good/bad project scales. Of course, there's nothing that's preventing you from capturing WUs from different atom ranges if you feel like it.

*Generally speaking, you can't predict what WU your system will get. However, once you get a WU from a Project you like, you can capture it for benchmarking to your heart's content without any hindrance to F@H. The method I used was in early V7 release and I assume that it still works:
1) Once you have spotted the elusive WU you want to capture, pause the CPU slot, copy the entire %AppData%\FAHClient folder (let's call it Benchmark)
2) Resume the CPU Slot and set it to Finish
3) After the CPU Slot has finished, disconnect the LAN cable (this is a fail safe step)
4) Exit the FAHClient
5) Copy the contents of %AppData%\FAHClient folder again (let's call this Live)
6) Delete the contents of %AppData%\FAHClient and copy of the contents of Benchmark into it
7) Modify the config.xml file as follows (I have made comments next to each setting that I think might help you achieve consistency with the highest optimization):

Code: Select all

<config>
  <!-- Folding Core -->
  <checkpoint v='30'/>                     //Frequent checkpoints will slow down CPU processing. Setting it to the max will ensure the highest performance level
  <core-priority v='low'/>                 //Higher priority then idle ensure that any idle processes in Windows doesn't impact the CPU performance

  <!-- Slot Control -->
  <pause-on-start v='true'/>               //This allows you to start-up the client and ready the system before you pull the trigger
  <power v='full'/>                        //Just in case to ensure that the client doesn't do anything funny

  <!-- User Information -->
  <user v='Benchmarking_In_Progress'/>     //You can easily identify that this is a benchmark and not real WU.

  <!-- Work Unit Control -->
  <dump-after-deadline v='false'/>         //You can now fold this WU way past the deadline if you want to.
  <next-unit-percentage v='100'/>          //Prevents the FAHClient from disturbing the CPU folding at 99%

  <!-- Folding Slots -->
  <slot id='0' type='CPU'>
    <cpus v='4'/>                          //Start with 1 and use _r2w_ben's list of CPU values.
  </slot>
</config>

8) Once you have finished benchmarking, exit FAHClient, delete the contents of %AppData%\FAHClient and then copy Live
9) Plug the LAN cable in and then start up FAHClient which should resume as if nothing ever happened.
Do note that if the WU was downloaded when the CPU had 16 CPUs, it will not run on any value higher than 16 in the Benchmark phase.

I hope this helps you out and you can speed up the workflow

Paragon · Post by **Paragon** » Sat May 23, 2020 10:35 pm

_r2w_ben wrote:Great write up Paragon! The Task Manager screenshot at the bottom is interesting. It's good to see that Windows is scheduling work on every other core so there is less sharing. As the threads reach a synchronization point and those running on a dedicated core finish first, do you see the load spread to different cores?

You can save a bit of time by skipping some CPU counts. FAH automatically reduces the number of CPUs to avoid primes and logs that change. Here's a list you can use as a reference:
Code: Select all
1
2
3
4 = 5
6 = 7
8
9
10 = 11
12 = 13 = 14
15
16 = 17
18 = 19
20
21 = 22 = 23
24
25 = 26
27
28 = 29
30 = 31
32
That leaves 20 out of 32 counts that FAH allows and should cut down your testing.

Thanks, thats helpful! Now that you mention it, I see it doing this in the log. I've already run a few of the "averaging" tests in this configuration and I'll probably include those on the plot, because with averaging it actually is very clear that this is what is happening in terms of the average PPD (the PPD at 5 and 4 threads is nearly the same, the only difference being due to some work unit variation).

Paragon · Post by **Paragon** » Sat May 23, 2020 10:38 pm

PantherX wrote:My guess would be from Highest PPD to lowest PPD assuming same Project is used in all the tests and the CPU frequency remains the same:
1x32 CPU slot (SMT Enabled)
1x16 CPU slot (SMT Disabled)
2x16 CPU slots (SMT Enabled)

Reason is that folding is physical core hungry (FPU to be exact IIRC) so SMT does provide some improvement but not as much as adding a physical core. With 2 WUs, you would be struggling to share the physical cores between two of them which will cause a slow-down in folding. I would be keen to see the results

Will surely be doing this (or some variation thereof). Also I might look at multiple work units that are thread locked to one CCX (The processor has two CCDs, and each CCD has two CCXs, and each CCX has four real cores). Keeping a job within one CCX (4 threads mapped to four real cores) might have some latency benefits.

Folding Forum

Ryzen 9 3950x Benchmark Machine: What should I test for you?

Ryzen 9 3950x Benchmark Machine: What should I test for you?

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for