avx2 and avx512 support in next core

Frisa · Post by **Frisa** » Wed Feb 06, 2019 4:13 pm

i dont know whether it is suitable to posting at this board as i cant find other related board about the development of core
as discussion raised from this thread (viewtopic.php?f=38&t=31377), i'd like to know would next gen cpu core would be more flexible in SMID support? like single binary for all SMID instructions(SSEs/AVXs)? and more importantly, would new core support AVX2 and/or AVX512?
thanks

JimboPalmer · Post by **JimboPalmer** » Wed Feb 06, 2019 7:17 pm

[I am not an authority, this is just my silly, wild assed guess]

It is my belief that core_a7 does support AVX2, as well as SSE2. GROMACS 2018 does support the following new CPU enhancements:

Achieved speedup on Intel KNL processors of around 11% for PME spread/gather on typical simulation systems.

In the simple case of leap-frog without pressure coupling and with at most one temperature-coupling group, the update of velocities and coordinates is now implemented with SIMD intrinsics for improved simulation rate.

AMD Ryzen appears to always perform slightly better with OpenMP than MPI, up to using all 16 threads on the 8-core die.

While Ryzen supports 256-bit AVX2, the internal units are organized to execute either a single 256-bit instruction or two 128-bit SIMD instruction per cycle. Since most of our kernels are slightly less efficient for wider SIMD, this improves performance by roughly 10%.

On AMD Zen, tabulated Ewald kernels are always faster than analytical. And with AVX2_256 2xNN kernels are faster than 4xN. These faster choices are now made based on CpuInfo at run time.

The group-scheme kernels can use AVX instructions from either the AVX_128_FMA and AVX_256 extensions. But hardware that supports the new AVX2_128 extensions also supports AVX_256, so we enable such support for the group-scheme kernels.

Recent Intel x86 hardware can have multiple AVX-512 FMA units, and the number of those units and the way their use interacts with the way the CPU chooses its clock speed mean that it can be advantageous to avoid using AVX-512 SIMD support in GROMACS if there is only one such unit. Because there is no way to query the hardware to count the number of such units, we run code at CMake and mdrun time to compare the performance from using such units, and recommend the version that is best. This may mean that building GROMACS on the front-end node of the cluster might not suit the compute nodes, even when they are all from the same generation of Intel’s hardware. - http://manual.gromacs.org/documentation ... mance.html

Notice that some improvements can look at CPUinfo at runtime and make good choices, Easy for F@H. Some have to be fixed at compile time, (CMAKE) so can make bad choices if executed on any other CPU than the one it was compiled on. F@H will not easily take advantage of those optimizations.

I do not believe F@H changes GROMACS version within a core, but I do think they always use the latest stable version for a new core. (I do not know if any new CPU core is far enough along to have locked which version of GROMACS to use)

Post by **bruce** » Wed Feb 06, 2019 9:01 pm

The FAHCore_a7 that I have uses GROMACS, VERSION 5.0.4 which was the stable version at the time A7 went through development. At that time, SSE2 and AVX were not supported in the same version so there are two versions of FAHCore_a7, as selected by the FAHClient. From what I read on gromacs.org, the performance difference between the various AVX versions would yield very similar performance levels, so developing a new FAHCore_a* to incorporate a later stable version of GROMACS would be a new development version with only minor improvements.

At some time in the future, a later version of GROMACS may be incorporated into a new FAHCore, but only when it's worth the extra developmental effort -- as opposed to spending that same developmental effort on some other aspect of FAH.

For additional information, consult http://gromacs.org

Post by **toTOW** » Wed Feb 06, 2019 11:10 pm

As I remember the technical explanations, the issue is that GROMACS doesn't support well (at least when Core A7 was built) the dynamic selection of optimized code. So the choice was made to hardcode the instruction set at core compilation. Of course, an SSE2 version has been build for older hardware because newer SSEx version doesn't help in FAH and are not always supported on all CPUs.

Also, an AVX core was chosen, because when the decision was made, AVX was supported on all CPUs, but newer iterations were not (or just on a few CPU for AVX2).

If a GROMACS version that is able to dynamically select optimized code without issue is released, I'm pretty sure that a future version of Fahcore will use it ...

Post by **bruce** » Thu Feb 07, 2019 3:49 am

Right. Gromacs VERSION 5.0.4 (single precision) could be compiled for SSE2 or for AVX but the AVX version could not down-select to SSE so two versions were compiled and built into two versions of FAHCore_A7. As I remember, it made a significant difference when AVX was added -- but support for more than one AVX version made little difference difference (as I said above).

The original SSE code contained ALC code which would pack two SP data words into a DP register so that the SSE instructions that operated on both halves of the registers could be utilized and then unpacked the results. My guess is that the same logic would be used to use AVX512, whether the packing/unpacking was done by GROMACS code or by AVX* firmware.

foldy · Post by **foldy** » Fri Feb 08, 2019 9:50 am

I hope avx2 and avx512 stay disabled if it does not bring a big performance boost as CPUs get very hot with these units and throttle down clock.

Folding Forum

avx2 and avx512 support in next core

avx2 and avx512 support in next core

Re: avx2 and avx512 support in next core

Re: avx2 and avx512 support in next core

Re: avx2 and avx512 support in next core

Re: avx2 and avx512 support in next core

Re: avx2 and avx512 support in next core