Does OpenMM take full advantage of rDNA for Navi GPUs?

NoMoreQuarantine · Post by **NoMoreQuarantine** » Tue Apr 14, 2020 5:31 pm

Several people have pointed out in my thread on "Top GPUs for Folding@Home" (viewtopic.php?f=38&t=34240) that the 2060 Super performs better than the 5700 XT, despite the 5700 XT having more FP32 cores. Juggy and Tohya posted benchmarks using the new FAHBench compiled by foldy; their results showed that the 2060 Super outperformed the 5700 XT by 15%, despite the 5700 XT also operating at a higher clock rate. My calculations tell me that the 5700 XT should have outperformed the the 2060 Super by 23% at those frequencies. gordonbb and foldinghomealone2 both suggested that the AMD drivers are not as optimized as NVDIA drivers. While that may be the whole story, another possibility could be that some tasks in OpenMM are using GCN instead of rDNA on Navi GPUs. Any chance that could be the case? Thanks for any information!

JimboPalmer · Post by **JimboPalmer** » Tue Apr 14, 2020 7:43 pm

As the title suggests, OpenMM is open source, you are free to read it yourself.

http://openmm.org/

https://en.wikipedia.org/wiki/RDNA_(microarchitecture) As I read this much of the performance is in the render pipeline. Folding does not render.

OTOH, AMD has invested in 'primitive shaders' (FP16, half precision) and just like Nvidia's FP16 shaders, they do not help F@H as it needs more precision. In both cases they are idle.

By some definition, F@H does not use rDNA, it uses OpenCL. If AMD's OpenCL code improves, then F@h will speed up without change.

foldy · Post by **foldy** » Tue Apr 14, 2020 7:51 pm

The issue with 5700 XT lower performance as expected may be result of a low atom count work unit. It needs to get retested using FahBench and a high atom count work unit.

NoMoreQuarantine · Post by **NoMoreQuarantine** » Tue Apr 14, 2020 8:11 pm

JimboPalmer wrote:As the title suggests, OpenMM is open source, you are free to read it yourself.

http://openmm.org/

https://en.wikipedia.org/wiki/RDNA_(microarchitecture) As I read this much of the performance is in the render pipeline. Folding does not render.

OTOH, AMD has invested in 'primitive shaders' (FP16, half precision) and just like Nvidia's FP16 shaders, they do not help F@H as it needs more precision. In both cases they are idle.

I was hoping there might be someone familiar with OpenMM here. I have been looking at it, but I am not a computer scientist, much less one trained for GPU optimization. Navi can compute using either GCN or rDNA, not just for rendering. GCN can perform an FP16 operation in 1 clock cycle, an FP32 op in 2 clock cycles, and an FP64 op in 4 clock cycles. rDNA changes the design of their execution units so they can perform an FP16 op in one cycle in GCN mode, an FP32 op in one clock cycle in Wave32 mode, or an FP64 op in one clock cycle in Wave64 mode.
https://www.amd.com/system/files/docume ... epaper.pdf

Post by **Joe_H** » Tue Apr 14, 2020 8:27 pm

Not sure of the details, but I do know they had to update code in OpenMM to support the RDNA based cards. Before that they were not usable for F@h. The GPU folding core that supported them was released for beta testing in late December, and not released to full use until the end of January.

NoMoreQuarantine · Post by **NoMoreQuarantine** » Tue Apr 14, 2020 8:39 pm

foldy wrote:The issue with 5700 XT lower performance as expected may be result of a low atom count work unit. It needs to get retested using FahBench and a high atom count work unit.

They both ran a simulation of 64614 atoms. Maybe that wasn't large enough to fully load the processors? If that turns out to be the case, then we'll want to recommend longer test times when benchmarking with FAHBench.

foldinghomealone2 · Post by **foldinghomealone2** » Tue Apr 14, 2020 9:28 pm

NoMoreQuarantine wrote:They both ran a simulation of 64614 atoms. Maybe that wasn't large enough to fully load the processors? If that turns out to be the case, then we'll want to recommend longer test times when benchmarking with FAHBench.

Testing as long as you get results you want to see?
And it's not about the time (although that matters to see the effect of the cooling solution) but about atoms count.

Better go to AMD and complain about their crappy OpenCL-implementation

NoMoreQuarantine · Post by **NoMoreQuarantine** » Tue Apr 14, 2020 9:39 pm

foldinghomealone2 wrote:Testing as long as you get results you want to see?

Better go to AMD and complain about their crappy OpenCL-implementation

People simulate large atoms for FAH. If the atom size at 1 minute is too small to give an accurate representation of performance for FAH, then it's not very helpful to the people trying to benchmark. That said, I doubt that is the issue. I also doubt it's the OpenCL implementation.

Post by **bruce** » Tue Apr 14, 2020 10:03 pm

The issue of protein atom-count is prominent in discussions of the NVidia GPUs, too. For GPUs with large numbers of shaders, their performance also drops for small proteins but is acceptable for large proteins. As far as I know, the OpenCL 1.2 API and the FAHCore itself being used are identical.

Back to nV: I consider the 2060 and above examples of the same issue and apparently we're talking about the same order of shader counts. For those GPUs which support half precision, it's also a wasted feature for them.

If you post the project number(s) that you're testing and the specific Navi model, I'll ask around but I don't have the equipment to be able to personally compare results.

As was mentioned eariler, it probably best to ignore the rDNA benchmarks. Comparing the FP32 FLOPS plus a small percentage of FP64 FLOPS is a reasonable approximation for FAH benchmarks but of course the actual benchmark is better than that sort of approximation.

JimboPalmer · Post by **JimboPalmer** » Tue Apr 14, 2020 10:05 pm

NoMoreQuarantine wrote: rDNA changes the design of their execution units so they can perform an FP16 op in one cycle in GCN mode, an FP32 op in one clock cycle in Wave32 mode, or an FP64 op in one clock cycle in Wave64 mode.

I also doubt it's the OpenCL implementation.

If the OpenCL implementation is correctly setting all these modes, you may be right. But it would be interesting to watch a trace to see how often they are in the 'wrong' mode and what the overhead of swapping modes is.

We can see that the OpenMM programmer was tuning at the level he controls but it would not surprise me if AMD is staying in GCN mode more often than is optimal. "we've always done it that way" is never a good excuse.

The comments about Very Long Instruction Word refer to the even older Terascale2 and 3 GPUs.

NoMoreQuarantine · Post by **NoMoreQuarantine** » Tue Apr 14, 2020 10:25 pm

bruce wrote:If you post the project number(s) that you're testing and the specific Navi model, I'll ask around but I don't have the equipment to be able to personally compare results.

I don't own a Navi GPU, we would have to get someone else to assist.

bruce wrote:As was mentioned eariler, it probably best to ignore the rDNA benchmarks. Comparing the FP32 FLOPS plus a small percentage of FP64 FLOPS is a reasonable approximation for FAH benchmarks but of course the actual benchmark is better than that sort of approximation.

The 5700 XT that was tested should have 10.5 FP32 TFLOPS and the 2060 Super should have 8.1 FP32 TFLOPS at the frequencies posted. The reason I made this thread was because I was making a list of the the current generation of AMD & NVIDIA GPU specs and found that discrepancy when comparing actual performance.

Post by **bruce** » Tue Apr 14, 2020 10:34 pm

Understood but the small protein problem does reduce the performance measurably below the large protein performance.
(Which is another way of saying that GPU folding performance is NOT linear.)

Somebody who knows the internals of OpenMM may have some cogent comments.

foldinghomealone2 · Post by **foldinghomealone2** » Wed Apr 15, 2020 1:02 am

NoMoreQuarantine wrote:
foldinghomealone2 wrote:Testing as long as you get results you want to see?

Better go to AMD and complain about their crappy OpenCL-implementation
People simulate large atoms for FAH. If the atom size at 1 minute is too small to give an accurate representation of performance for FAH, then it's not very helpful to the people trying to benchmark. That said, I doubt that is the issue. I also doubt it's the OpenCL implementation.

Believe what you want to believe. That'll make your theoretical approaches not a bit better.

But you don't have to believe me, you can test it yourself.
FahBench's run length has nothing to do with atom count. It's the same WU with 64k atoms. You just run the benchmark longer if you increase the time.
With increased times you can test your system when it's heated through and see if it's stable then.

And why should a higher atom count be 'better'?
It should only be higher if all current WUs have more atoms to see realistic results.
But atom counts differ from project to project. Like currently released projects p14549 (28k atoms) and p14415 (290k atoms).

I don't know what the best number of atoms to bench would be.
Maybe 64k resembles a good average of current projects, maybe it should be higher.

But just saying it should be higher because you think the 5700XT would score higher is the wrong approach.
And what about all the 'slower' GPUs that can't handle much more atoms well? I guess then the score would under-represent their value to folding.

To be on the safe side it would be necessary to run several benchmarks with low atom count, average/medium atom count and high atom count.
And run 1min tests and 15min tests (to even out starting conditions and to reflect a real-world folding scenario with 'hot' GPUs)

NoMoreQuarantine · Post by **NoMoreQuarantine** » Wed Apr 15, 2020 2:35 am

foldinghomealone2 wrote:Believe what you want to believe. That'll make your theoretical approaches not a bit better.

Did I say something to offend you at some point? I think you've been insulting my effort since the first time I saw your username.

foldinghomealone2 wrote:But you don't have to believe me, you can test it yourself.
FahBench's run length has nothing to do with atom count. It's the same WU with 64k atoms. You just run the benchmark longer if you increase the time.
With increased times you can test your system when it's heated through and see if it's stable then.

Yep, you have to change the WU to get a different atom count. I wasn't thinking when I wrote about run length.

foldinghomealone2 wrote:And why should a higher atom count be 'better'?
It should only be higher if all current WUs have more atoms to see realistic results.
But atom counts differ from project to project. Like currently released projects p14549 (28k atoms) and p14415 (290k atoms).

I don't think it would be better, but foldy proposed that a low atom count may be the reason for the performance difference.

foldinghomealone2 wrote:I don't know what the best number of atoms to bench would be.
Maybe 64k resembles a good average of current projects, maybe it should be higher.

We'd have to look at what kind of distribution the projects have. Likely all over the place.

foldinghomealone2 wrote:But just saying it should be higher because you think the 5700XT would score higher is the wrong approach.
And what about all the 'slower' GPUs that can't handle much more atoms well? I guess then the score would under-represent their value to folding.

I'm glad nobody said that then. Good point with the slower GPUs. I don't know, it could accurately represent their value, depends on how FAH distributes WUs to slower GPUs.

foldinghomealone2 wrote:To be on the safe side it would be necessary to run several benchmarks with low atom count, average/medium atom count and high atom count.
And run 1min tests and 15min tests (to even out starting conditions and to reflect a real-world folding scenario with 'hot' GPUs)

Sounds reasonable.

Post by **PantherX** » Wed Apr 15, 2020 3:05 am

NoMoreQuarantine wrote:...We'd have to look at what kind of distribution the projects have. Likely all over the place...

The smallest GPU Project (14321) right now has 13,252 atoms
The largest GPU Project (14416) right now has 307,167 atoms
https://apps.foldingathome.org/psummary

With new GPU Projects (if/when) they are released, there's a possibility that the above values may change.

Folding Forum

Does OpenMM take full advantage of rDNA for Navi GPUs?

Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?

Re: Does OpenMM take full advantage of rDNA for Navi GPUs?