Ryzen 9 3950x Benchmark Machine: What should I test for you?

Paragon · Post by **Paragon** » Sat Jun 20, 2020 11:54 am

PantherX wrote:Great write-up and I look forward to the comparison between SMT on and off.

Back when i7-860 was released, I tested it using the same WUs* to see if the HT on/off would make a difference. I discovered that there's a 12% to 25% reduction in TPF when going from 4 threads to 8 threads (IIRC).

I have two suggestions:
1) Can you please compare the TPF from the same WU*? IMO, PPD is great but what I look at is TPF as it provides a better representation. PPD will vary on internet speed and any server issues (fingers crossed, there won't be any outage).
2) You can capture 2 WUs*, one from a small Project (14677) and one from a large Project (14236) to see how good/bad project scales. Of course, there's nothing that's preventing you from capturing WUs from different atom ranges if you feel like it.

*Generally speaking, you can't predict what WU your system will get. However, once you get a WU from a Project you like, you can capture it for benchmarking to your heart's content without any hindrance to F@H. The method I used was in early V7 release and I assume that it still works:
1) Once you have spotted the elusive WU you want to capture, pause the CPU slot, copy the entire %AppData%\FAHClient folder (let's call it Benchmark)
2) Resume the CPU Slot and set it to Finish
3) After the CPU Slot has finished, disconnect the LAN cable (this is a fail safe step)
4) Exit the FAHClient
5) Copy the contents of %AppData%\FAHClient folder again (let's call this Live)
6) Delete the contents of %AppData%\FAHClient and copy of the contents of Benchmark into it
7) Modify the config.xml file as follows (I have made comments next to each setting that I think might help you achieve consistency with the highest optimization):
Code: Select all
<config>
  
  <checkpoint v='30'/>                     //Frequent checkpoints will slow down CPU processing. Setting it to the max will ensure the highest performance level
  <core-priority v='low'/>                 //Higher priority then idle ensure that any idle processes in Windows doesn't impact the CPU performance

  
  <pause-on-start v='true'/>               //This allows you to start-up the client and ready the system before you pull the trigger
  <power v='full'/>                        //Just in case to ensure that the client doesn't do anything funny

  
  <user v='Benchmarking_In_Progress'/>     //You can easily identify that this is a benchmark and not real WU.

  
  <dump-after-deadline v='false'/>         //You can now fold this WU way past the deadline if you want to.
  <next-unit-percentage v='100'/>          //Prevents the FAHClient from disturbing the CPU folding at 99%

  
  <slot id='0' type='CPU'>
    <cpus v='4'/>                          //Start with 1 and use _r2w_ben's list of CPU values.
  </slot>
</config>
8) Once you have finished benchmarking, exit FAHClient, delete the contents of %AppData%\FAHClient and then copy Live
9) Plug the LAN cable in and then start up FAHClient which should resume as if nothing ever happened.
Do note that if the WU was downloaded when the CPU had 16 CPUs, it will not run on any value higher than 16 in the Benchmark phase.

I hope this helps you out and you can speed up the workflow

This is an excellent suggestion, and is exactly what I'd like to do. However, I tried it a few times, and upon client restart (which I open after I edit the config file to set the new CPU count), it loads the benchmark work unit with whatever the CPU setting happened to be when that work unit was downloaded. For example, if the work unit was downloaded with 16 threads, it doesn't matter what the cpus setting is in the config file. Upon re-launching, it will be folding with 16 threads. I tried with a 1, 2, 3, and 16 thread solve (work units downloaded with these settings). I let them finish, close the client, delete the FAHClient folder contents, paste the files from the benchmark folder, edit the config file to specify a different # of CPUs, and relaunch, only to find it running with the previous # of threads!

Any thoughts?

Paragon · Post by **Paragon** » Sat Jun 20, 2020 11:55 am

I'll add that once the benchmark unit finishes (I let one finish to see what happens even though the work unit was previously completed before I copied the guts of the folder), the client pulls down a new work unit and solves with the intended thread count

Post by **PantherX** » Tue Jun 30, 2020 8:36 pm

Humm... I would have expected that to work since you're using CPU values less than 16 so shouldn't be any issues. Can you please post the log file?

Alternatively, you can capture the WU and FahCore_a7 and create a folder called FahCore_Testing. In that folder, place the FahCore_a7 and a folder called "Work" which will have wudata_01.dat (the captured WU) and start up from the CMD prompt .\FahCore_a7 -suffix 01 -np X where X can be any value. I have tested that on my system with a few trial and error to get the directory sorted out but after that, it just runs without any issues and the use of the client. Thus, this could be a more portable version of benchmarking that you can use

MeeLee · Post by **MeeLee** » Thu Jul 02, 2020 8:08 am

Do we have some initial PPD values already? Interested to see how much the QRB affects these cpus.

Paragon · Post by **Paragon** » Sun Aug 02, 2020 2:07 pm

MeeLee wrote:Do we have some initial PPD values already? Interested to see how much the QRB affects these cpus.

Sorry all, I've been busy with the day job and then had a vacation thrown in the mix. But yes, I have results! Still working on the TPF stuff, but here is part 1 (from before) and part 2 (new) of the article. Part 2 has some interesting plots of the work unit variation seen at each thread setting of the CPU. I did 5 tests at each # of threads setting to make sure I'm not getting too thrown off by work unit variation.

Part 1: https://greenfoldingathome.com/2020/05/ ... f-threads/

Part2: https://greenfoldingathome.com/2020/08/ ... variation/

All things considered, the 3950X is a beast! I am currently re-doing all the tests with SMT off to understand the effect of hyperthreading on the results. That will be part 3 of the review. Part 4 will look at the effect of Core Performance Boost (turbo boost) on efficiency and PPD.

MeeLee · Post by **MeeLee** » Sun Aug 02, 2020 3:44 pm

Also make sure you'll record average CPU frequency, and Wattage.
It'll be important for some people, to decide if SMT on vs off is worth the extra power consumption.
The average CPU frequency might rise with SMT off, and it would be interesting to see how much performance is affected by running at half the frequency, having double the L-cache per thread, and a higher boost frequency...

MeeLee · Post by **MeeLee** » Sun Aug 02, 2020 4:17 pm

I forgot to ask,
I saw that you tested with PBO disabled.
The real benefits of SMT disabled, are from PBO enabled.
Simply disabling PBO, means you'll effectively half your CPU thread count.
However, with PBO enabled, on a motherboard that can boost the CPU to at least 3,8Ghz on all cores (150W, or 8pin CPU plug), in theory it should boost the CPU to 4,1Ghz with SMT disabled, and PBO enabled.
It's the difference between running higher core count, lower frequency, vs lower core count, higher frequency.

Then there's the question on how much performance is affected by RAM.
We know FAH doesn't need very fast RAM (save if you have an IGP using shared RAM),
However, it's still interesting to know what the average PPD is running memory on stock (2133Mhz, with XMP disabled), and with XMP enabled.
If you have one of the first batch of Ryzen 9 3900 series CPUs, when they just came out, your infinity speed might max out at 1600Mhz (max ram 3200Mhz).
If you have a more modern version, somewhere released half a year, to a year ago, the peak Infinity performance is around 1800Mhz (max ram 3600Mhz). Any faster RAM would not significantly impact performance.
One of the newest tested 3900-series CPUs actually get an infinity fabric that can run past 1866Mhz, (3733Mhz RAM).

The settings are endless, but if you have a way to test the CPU with stock 2133Mhz, vs 3600Mhz (make sure infinity fabric is set to half the RAM values, meaning 1800Mhz), that would be a setting where most people would run their CPUs at.

To do this, you'll need to disable 'spread spectrum', and in some cases, set the FSB speed to 100.00 MHz manually, as some motherboards from the factory overclock this by (a fraction of) a Mhz.
Normally, motherboard manufacturers should not play with this value, and setting it to 'auto', might adjust it, and skew your results.

To summarize, set values to most common settings:
FSB = 100.00 MHz,
Spread spectrum = OFF

Then record power values in below settings, with SMT ON/OFF:

- XMP off = 2133Mhz RAM, IF = 1066 MHz
- XMP on = 3600Mhz, IF = 1800 MHz

and:

- PBO = off (CPU frequency = 3.50Ghz)
- PBO = on (CPU frequency depends on motherboard, and needs to be recorded for XMP OFF, and XMP on).

Those should be 8 charts, and could become more, if you're wanting to test these values with different WU counts...

Paragon · Post by **Paragon** » Sun Aug 02, 2020 8:14 pm

Sounds like a plan. I've already started running with the bios-controlled boost on. I'm just going to do it all over again and make two more curves (SMT on and off with boost enabled). Right now, with SMT on, running the 1st of 32 settings (CPU threads = 1), I am seeing around 12,500 PPD vs. the 8800 PPD it got before with 1 thread. Clocks on the active core are hovering around 4.35 GHz, CPU temp is around 72 C (as opposed to 55 C with boost off). Power is up to 106 watts (vs 75 watts with boost off).

So that means, for 1 thread, with SMT enabled:

Boost off (3.5 GHz) efficiency = 8840 PPD / 75 watts = 118 PPD/Watt
Boost on (4.35 GHz) efficiency = 12500 PPD / 106 watts = 118 PPD/Watt

So in other words, production is up by about 40% and the efficiency is the same. I wonder if this trend will hold all the way to 32 threads? It'll probably take me about a month to find out...

I can definitely run the memory / infinity fabric thing as well. I've got it all running linked at 3600 MHz / 1800 MHz (ram / fabric) at the moment. Should be relatively simple to go back to baseline. I'll save that nugget for the end (will probably do it with whatever thread / boost / smt setting is the best)

Paragon · Post by **Paragon** » Mon Aug 03, 2020 12:24 am

Ok, Part 3 of the review is up. This contains the plots of SMT (Hyperthreading) on vs. off for the Ryzen 9 3950x.

https://greenfoldingathome.com/2020/08/ ... threading/

The big takeaway here is that SMT's virtual cores (Hyperthreading) really helps folding on this Ryzen processor. Going from 16 threads (all physical cores cranking away) to 32 threads (all cores cranking with 2 threads per core) resulted in a 30% performance improvement. That's pretty significant. Even more interesting was that the efficiency went up as well (PPD/Watt @ the wall).

Finally, I found that running the CPU client with one physical core unloaded (i.e. 15 threads for SMT Off, 30 threads for SMT On) offered noticeably better performance and efficiency than fully maxing the processor out (16 / 32). Has anyone else noticed this? I'd normally say this was just due to work unit variation, but I ran two independent tests with 5 averages per test, and the trend is very clear.

gunnarre · Post by **gunnarre** » Mon Aug 03, 2020 2:24 am

MeeLee wrote: However, with PBO enabled, on a motherboard that can boost the CPU to at least 3,8Ghz on all cores (150W, or 8pin CPU plug), in theory it should boost the CPU to 4,1Ghz with SMT disabled, and PBO enabled.

PBO - Performance Boost Overdrive = Increasing the power limits to the CPU socket to higher than regular AM4 spec, but still within motherboard manufacturer supplied settings. Might not give measurable effect on a CPU that has plenty of power headroom (like a Ryzen 5 3600 on an budget B450 board), but might give some more multicore performance on a board where the CPU is close to the nominal power of the soket (like a Ryzen 9 3950x on an X570 board with good VRMs). PBO is not enabled by default.

Performance Boost = Dynamic frequency adjustment of individual cores based on load. Asus calls it "EZ Tuning: Normal mode" or "TPU off", MSI calls it "Cool'n'quiet", I think. Can give very higher single-core (and few-core) frequencies and efficiency than running all cores at the same frequency. Performance Boost is enabled by default.

We're talking about the second thing here, right, since it's about frequency?

Paragon · Post by **Paragon** » Tue Aug 04, 2020 1:59 am

gunnarre wrote:
MeeLee wrote: However, with PBO enabled, on a motherboard that can boost the CPU to at least 3,8Ghz on all cores (150W, or 8pin CPU plug), in theory it should boost the CPU to 4,1Ghz with SMT disabled, and PBO enabled.
PBO - Performance Boost Overdrive = Increasing the power limits to the CPU socket to higher than regular AM4 spec, but still within motherboard manufacturer supplied settings. Might not give measurable effect on a CPU that has plenty of power headroom (like a Ryzen 5 3600 on an budget B450 board), but might give some more multicore performance on a board where the CPU is close to the nominal power of the soket (like a Ryzen 9 3950x on an X570 board with good VRMs). PBO is not enabled by default.

Performance Boost = Dynamic frequency adjustment of individual cores based on load. Asus calls it "EZ Tuning: Normal mode" or "TPU off", MSI calls it "Cool'n'quiet", I think. Can give very higher single-core (and few-core) frequencies and efficiency than running all cores at the same frequency. Performance Boost is enabled by default.

We're talking about the second thing here, right, since it's about frequency?

That's correct. I disabled Core Performance Boost (frequency scaling on a per-core basis) to lock the processor at 3.5 GHz for all of my SMT On vs SMT off testing. I recently just turned it back on to let the clock rate climb to see the effect (rerunning all SMT on vs off tests). Another thing I can do after all of this testing is done is to enable PBO on my X570 board, which does have the sweet VRMs and 8 + 4 pin CPU power. I've already played a few games with PBO on with some manual frequency offsets and was able to sustain a 4.5 GHz, all-core Prime95 run (power consumption was nearly twice of stock and even my Noctua dual tower cooler was hitting 95C). I expect if I tried folding like that the efficiency would be horrible.

MeeLee · Post by **MeeLee** » Tue Aug 04, 2020 2:23 am

It's kind of to be expected that running more cores, at a lower frequency is going to be more efficient.
My Atomic Pi has an Atom X5 Z3850, and runs 4 cores at 1,69Ghz.
Pairing 16 of them, costs about the same, is about as fast as a Ryzen 9 3900-3950x CPU, and consumes the same as well.
It's all about core efficiency.

MeeLee · Post by **MeeLee** » Tue Aug 04, 2020 2:37 am

Also,
You mentioned PBO was disabled, right?
By any chance, did you record the CPU frequency (not from Taskmanager, but from a program like CPU-Z or something), when enabling SMT?
I know with PBO, when there's between 50-75% load on the CPU, it cuts down on the CPU frequency, to preserve power.
Just wanting to know if the core frequency stays at 3,5Ghz in these moments.
Second thing to note,
If you are getting a WU, when your client is set to 15 cores, and you up the core count, there's a chance, that the client needs to finish the WU on those 15 cores, and only really uses the remaining cores when a new WU is downloaded.

These are the 2 possible reasons I can think of, why 16 cores could be faster vs 16-28 threads for folding.
In theory it wouldn't make sense that at 3,5Ghz fixed, more cores would be slower than fewer. Unless the client only keeps using 15/16 threads, until the WU is processed; and a loss of speed would be due to moving cache data around between Core Chiplet Dies.

Neil-B · Post by **Neil-B** » Tue Aug 04, 2020 6:05 am

one thing to consider - and I am no way expert in this - but I believe the way Gromacs works with higher count slots and the use of PME means actually (for instance) on some WUs my 24 slot is actually folding as a 20 slot with 4 PME threads - now I don't quite know what this does for throughput or how this layer of complexity changes workload on the processor but it may explain some of the odder benchmarks I have spotted over time?

Paragon · Post by **Paragon** » Tue Aug 04, 2020 11:15 am

MeeLee wrote:Also,
You mentioned PBO was disabled, right?
By any chance, did you record the CPU frequency (not from Taskmanager, but from a program like CPU-Z or something), when enabling SMT?
I know with PBO, when there's between 50-75% load on the CPU, it cuts down on the CPU frequency, to preserve power.
Just wanting to know if the core frequency stays at 3,5Ghz in these moments.
Second thing to note,
If you are getting a WU, when your client is set to 15 cores, and you up the core count, there's a chance, that the client needs to finish the WU on those 15 cores, and only really uses the remaining cores when a new WU is downloaded.

These are the 2 possible reasons I can think of, why 16 cores could be faster vs 16-28 threads for folding.
In theory it wouldn't make sense that at 3,5Ghz fixed, more cores would be slower than fewer. Unless the client only keeps using 15/16 threads, until the WU is processed; and a loss of speed would be due to moving cache data around between Core Chiplet Dies.

I watched the CPU frequency in AMD's built in monitoring tool (Ryzen Master) and confirmed it stayed at 3.5 GHz for all tests, SMT on and off. When folding is not running, the frequency does drop down, but it stays up at 3.5 GHz for cores loaded over 75%. So basically any cores with a Folding job were at 3.5, regardless of SMT setting.

Also for all tests, I don't record results after I switch a thread setting until the next work unit shows up. Most work units don't change the # of threads being used after changing the CPU slot config. So, all results reported are "fresh" work units that were downloaded with the new CPU thread setting.

Folding Forum

Ryzen 9 3950x Benchmark Machine: What should I test for you?

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for

Re: Ryzen 9 3950x Benchmark Machine: What should I test for