This is what i I think I know.
All Zen chips can do avx_256
Zen 1 and Zen + did avx 128 bits at a time, so it took two instructions to do 256 worth of avx. Zen 2 does 256 bits at a time so takes half as long.
"Like the Zen/Zen+ microarchitecture, the Zen 2 floating point unit utilizes a coprocessor architectural model comprising a dedicated rename unit, a single 4-issue, out-of-order scheduler, a 160-entry physical register file (PRF), and four execution pipelines. The in-order retire queue is shared with the integer unit. The FPU handles x87, MMX, SSE, and AVX instructions. FP loads and stores co-opt the EX unit for address calculations and the LS unit for memory accesses.
In the Zen/Zen+ microarchitecture the floating point physical registers, execution units, and data paths are 128 bits wide. For efficiency AVX-256 instructions which perform the same operation on the 128-bit upper and lower half of a YMM register are decoded into two macro-ops which pass through the FPU individually as execution resources become available and retire together. Accordingly the peak throughput is four SSE/AVX-128 instructions or two AVX-256 instructions per cycle.
Zen 2 doubles the width of the physical registers, execution units, and data paths to 256 bits. The L1 data cache bandwidth was doubled to match. The number of micro-ops issued by the FP scheduler remains four, implying most AVX-256 instructions decode to a single macro-op which conserves queue entries and reduces pressure on RCU and scheduling resources. AMD did not disclose how the FPU was restructured. Die shots suggest two execution blocks splitting the PRF and FP ALUs, one operating on the lower 128 bits of a YMM register, executing x87, MMX, SSE, and AVX instructions, the other on the upper 128 bits for AVX-256 instructions. This improvement doubles the peak throughput of AVX-256 instructions to four per cycle, or in other words, up to 32 FLOPs/cycle in single precision or up to 16 FLOPs/cycle in double precision. Another improvement reduces the latency of double-precision vector multiplications from 4 to 3 cycles, equal to the latency of single-precision multiplications. The latency of fused multiply-add (FMA) instructions remains 5 cycles." -
https://en.wikichip.org/wiki/amd/microa ... Point_Unit
What i do not know is if the version of GROMACS in current use can tell Zen 2 from Zen 1/+. You would get the same answers even if you couldn't, but it might not have the same optimization. I would guess the change log over at GROMACS.org would show when/if that optimization was added.