AMD Zen 4 Tuning Patches Begin Landing In GCC 13

Written by Michael Larabel in AMD on 22 December 2022 at 08:43 AM EST. 6 Comments

Following the basic AMD Zen 4 "znver4" target enablement that was merged for the GCC 13 compiler in October, patches to begin providing tuned support have begun merging for this next GNU Compiler Collection release.

As noted in prior Phoronix articles the initial Znver4 enablement in GCC 13 flipped on the new instructions supported by the Ryzen 7000 series and EPYC 9004 series processors but copied over the existing tuning from Zen 3. Earlier this month a SUSE engineer then began working on a proper Zen 4 cost table and tuning for the Zen 4 processors given their different characteristics from Zen 3. It's those patches from SUSE that have been merged since yesterday into GCC 13.

Jan Hubicka's patch for the Znver4 costs has landed for the appropriate instruction cost tables for the compiler:

"Update cost of znver4 mostly based on data measued by Agner Fog. Compared to previous generations x87 became bit slower which is probably not big deal (and we have minimal benchmarking coverage for it). One interesting improvement is reducation of FMA cost. I also updated costs of AVX256 loads/stores based on latencies (not throughput which is twice of avx256). Overall AVX512 vectorization seems to improve noticeably some of TSVC benchmarks but since internally 512 vectors are split to 256 vectors it is somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks with loop that have small trip count like x264 and exchange), so for now I am going to set AVX256_OPTIMAL tune but I am still playing with it. We improved since ZNVER1 on choosing vectorization size and also have vectorized prologues/epilogues so it may be possible to make avx512 small win overall."

And then a second set of tuning for Zen 4 has also been merged:

"Adds tunes needed for zen4 microarchitecture. I added two new knobs. TARGET_AVX512_SPLIT_REGS which is used to specify that internally 512 vectors are split to 256 vectors. This affects vectorization costs and reassociation width. It probably should also affect RTX costs however I doubt it is very useful since RTL optimizers are usually not judging between 256 and 512 vectors.

I also added X86_TUNE_AVOID_256FMA_CHAINS. Since fma has improved in zen4 this flag may not be a win except for very specific benchmarks. I am still doing some more detailed testing here.

Otherwise I disabled gathers on zen4 for 2 parts anbd 4 parts. We can open code them and since the latencies has only increased since zen3 opencoding is better than actual instruction. This shows at 4 tsvc benchmarks.

I ended up setting AVX256_OPTIMAL. This is a compromise. There are some tsvc benchmarks that increase noticeably (up to 250%) however there are also few regressions. Most of these can be solved by incrasing vec_perm cost in the vectorizer. However this does not cure about 14% regression on x264 that is quite important. Here we produce vectorized loops for avx512 that probably would be faster if the loops in question had high enough iteration count. We hit this problem with avx256 too: since the loop iterates few times, only prologues/epilogues are used. Adding another round of prologue/epilogue code does not make it better.

Finally I enabled avx stores for constnat sized memcpy and memset. I am not sure why this is an opt-in feature. I think for most hardware this is a win."

Those patches with this initial AMD Zen 4 support will be part of the GCC 13.1 stable compiler release that should be out in March~April. We'll see if any further Znver4 optimizations are readied for GCC 13. AMD meanwhile is offering the AOCC 4.0 compiler for those wanting a production-ready Zen 4 optimized compiler right now.

I'll be working on some new GCC Git benchmarks on Zen 4 over Christmas.

6 Comments