Page 1 of 1

Detected instruction sets incorrect?

Posted: Wed May 06, 2020 4:34 am
by puuteknikko
I happened to take a look at the files written by FAH and noticed this strange message in md.log

Code: Select all

Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: AuthenticAMD
Brand:  AMD Ryzen 9 3900X 12-Core Processor            
Family: 23  Model: 113  Stepping:  0
Features: aes apic avx clfsh cmov cx8 cx16 f16c fma htt lahf_lm misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4a sse4.1 sse4.2 ssse3
SIMD instructions most likely to fit this hardware: AVX_128_FMA
SIMD instructions selected at GROMACS compile time: AVX_256


Binary not matching hardware - you might be losing performance.
SIMD instructions most likely to fit this hardware: AVX_128_FMA
SIMD instructions selected at GROMACS compile time: AVX_256
Is it really not detecting the new Zen's AVX unit properly? It is AVX-256..

Re: Detected instruction sets incorrect?

Posted: Wed May 06, 2020 7:33 am
by JimboPalmer
This is what i I think I know.
All Zen chips can do avx_256
Zen 1 and Zen + did avx 128 bits at a time, so it took two instructions to do 256 worth of avx. Zen 2 does 256 bits at a time so takes half as long.

"Like the Zen/Zen+ microarchitecture, the Zen 2 floating point unit utilizes a coprocessor architectural model comprising a dedicated rename unit, a single 4-issue, out-of-order scheduler, a 160-entry physical register file (PRF), and four execution pipelines. The in-order retire queue is shared with the integer unit. The FPU handles x87, MMX, SSE, and AVX instructions. FP loads and stores co-opt the EX unit for address calculations and the LS unit for memory accesses.

In the Zen/Zen+ microarchitecture the floating point physical registers, execution units, and data paths are 128 bits wide. For efficiency AVX-256 instructions which perform the same operation on the 128-bit upper and lower half of a YMM register are decoded into two macro-ops which pass through the FPU individually as execution resources become available and retire together. Accordingly the peak throughput is four SSE/AVX-128 instructions or two AVX-256 instructions per cycle.

Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 FLOPs/cycle in single precision or up to 16 FLOPs/cycle in double precision. Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles." - https://en.wikichip.org/wiki/amd/microa ... Point_Unit

What i do not know is if the version of GROMACS in current use can tell Zen 2 from Zen 1/+. You would get the same answers even if you couldn't, but it might not have the same optimization. I would guess the change log over at GROMACS.org would show when/if that optimization was added.

Re: Detected instruction sets incorrect?

Posted: Wed May 06, 2020 7:49 am
by puuteknikko
http://manual.gromacs.org/documentation ... hlight=zen

Looks like the core is not using a recent enough source to cover Zen 2. There's a quite nice speed boost right there if you do..

EDIT: or hopefully it's the other way around -- that is just a warning and 256-bit instructions are used nevertheless.

Re: Detected instruction sets incorrect?

Posted: Wed May 06, 2020 8:17 am
by PantherX
I know that currently, FahCore_a7 has two "paths" to choose:
SSE
AVX_256

The question is if the FahCore_a7 is upgraded to support Zen 2, will it be an additional choice or not:
SSE (for CPUs without AVX support)
AVX_256 (for CPUs with AVX support)
AVX_128_FMA (For Zen 2)

Having multiple versions of FahCore_a7 to support might not be ideal if the gains aren't scientifically justifiable based on the resources available.

Re: Detected instruction sets incorrect?

Posted: Wed May 06, 2020 8:22 am
by JimboPalmer
I believe Core_a7 uses avx_256, on any Zen.

Changes in opcodes per instruction, and cycles per opcode may mean the code is not as fast as it might be if it were tunes specifically for Zen 2.
"Also the non-bonded kernel parameters have been tuned for Zen 2. This has a significant impact on performance."
So far as I know, Core_a7 is not this new, it dates back to 2017.

Re: Detected instruction sets incorrect?

Posted: Wed May 06, 2020 12:13 pm
by _r2w_ben
FAH uses GROMACS 5.0.4, which was released in 2014 and predates Zen. The message about AVX_128_FMA is relevant to Bulldozer/Piledriver that were available at the time. For those architectures, AVX_128_FMA > AVX_256 > SSE2.

With Zen 2 the message doesn't appear to be correct. The code probably checks if the CPU supports AVX_128_FMA and outputs the message because that was an accurate test at the time.

A researcher using GROMACS might compile the code on their computer and then run it on a cluster with a different CPU architecture. This message is a friendly reminder to choose compiler flags for optimum performance.

Re: Detected instruction sets incorrect?

Posted: Wed May 06, 2020 3:00 pm
by Joe_H
Basically the version of Gromacs in use by the A7 core does not include the ability to use run time optimizations. So a separate core executable is needed for each different architecture. The decision was made to create one that uses SSE2 to support older CPUs that do not have AVX, and the second is a generic AVX_256 core to be used by newer processors that support that.

This may mean that on a particular system there is some loss of efficiency, but from tests done at the time it was in the range of a few percent. Gained was an easier to support distribution system with just two different folding cores to be kept in development and synchronized.

I haven't followed the later versions of Gromacs to see if they added the ability to include run time code selection back in.