Announcement

Collapse
No announcement yet.

Benchmarking OpenMandriva 4.0 Alpha - The First Linux OS With An AMD Zen Optimized Build

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Benchmarking OpenMandriva 4.0 Alpha - The First Linux OS With An AMD Zen Optimized Build

    Phoronix: Benchmarking OpenMandriva 4.0 Alpha - The First Linux OS With An AMD Zen Optimized Build

    On Christmas Eve marked the long-awaited release of the OpenMandriva Lx 4.0 Alpha and with that new version of the Mandrake/Mandriva-derived operating system came an AMD Zen "Znver1" optimized Linux build. Of course that caught my interest and I was quickly downloading this first Linux distribution with an AMD Ryzen/EPYC optimized binaries to see how it compares to its generic x86_64 operating system installation.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Recently I got into building the kernel myself, I know how to remove/add drivers and features but never tried these build parameters. Let's say I downloaded the source, got the .config file setup correctly and I'm ready to hit make, how do I add the "Znver1" optimization? Would appreciate any input :-)

    Comment


    • #3
      Results are inconsistent... One would expect same performance as worst case scenario but here it is hit or miss - in average znver1 opts nullify potential gains. Is this again GCC quirk?

      Comment


      • #4
        Originally posted by Anty View Post
        Results are inconsistent... One would expect same performance as worst case scenario but here it is hit or miss - in average znver1 opts nullify potential gains. Is this again GCC quirk?

        They are surprising results, but this set of benchmarks does underline why simply blanket applying optimizations (ex: -march=native and such) could actually create performance regressions rather than enhancements or no change either way.

        Without deep dives into what's going on in the code paths all we can do is speculate what's causing the regressions. But for some real examples of such things in the past, you can look at the AVX-512 history in Intel CPUs. The generation in which AVX-512 was introduced only ran those instructions at half processor speed (example a 3 GHz CPU could only run AVX-512 instructions at 1.5 GHz) so any time that code path was taken there was the potential you could be slowing things down versus taking other potential instruction paths instead. There's other gotchas that can be hit with overly aggressive optimizations like unwanted code mangling and the like.

        For practical matters though: don't just assume aggressive optimizations are faster even if they seem so. Test, test, test using empirical results.

        Comment


        • #5
          Originally posted by jojo7887 View Post
          Recently I got into building the kernel myself, I know how to remove/add drivers and features but never tried these build parameters. Let's say I downloaded the source, got the .config file setup correctly and I'm ready to hit make, how do I add the "Znver1" optimization? Would appreciate any input :-)
          I recommend to use a Ramdisk for speeding up the process: sudo mount -t tmpfs -o size=2048m tmpfs /mnt (#then copy your Kernel source and config file to /mnt/linux-***)

          Search for HOSTCFLAGS / HOSTCXXFLAGS / HOSTLDFLAGS in the top-level Makefile and adjust them with the following (except for the optimization flags I recommend to keep the original flags):
          HOSTCLFAGS = -mtune=native -march=native -O3 -fasynchronous-unwind-tables -feliminate-unused-debug-types -ftree-loop-distribution -floop-nest-optimize -fgraphite-identity -floop-parallelize-all -std=gnu89 (#if you are brave you could try gnu11, this usually works nowadays)
          HOSTCXXFLAGS = -mtune=native -march=native -O3 -fasynchronous-unwind-tables -feliminate-unused-debug-types -ftree-loop-distribution -floop-nest-optimize -fgraphite-identity -floop-parallelize-all
          HOSTLDFLAGS = -O3 -mtune=native -march=native -fgraphite-identity -floop-nest-optimize -Wl,--as-needed -fopenmp

          Then start the compilation by typing in the console from within the Kernel directory: sudo make -j12 (#for a 6 Core) rpm-pkg (#or deb-pkg if you are on Debian/Ubuntu, this will get you three packages - their location depend on your distro, on Suse Tumbleweed they are in /usr/src/packages/X86_64/, with Ubuntu they are in the same or one above the top level directory).

          Make sure that you have all the build dependencies installed.
          Last edited by ms178; 26 December 2018, 11:03 AM.

          Comment


          • #6
            Originally posted by stormcrow View Post


            They are surprising results, but this set of benchmarks does underline why simply blanket applying optimizations (ex: -march=native and such) could actually create performance regressions rather than enhancements or no change either way.

            Without deep dives into what's going on in the code paths all we can do is speculate what's causing the regressions. But for some real examples of such things in the past, you can look at the AVX-512 history in Intel CPUs. The generation in which AVX-512 was introduced only ran those instructions at half processor speed (example a 3 GHz CPU could only run AVX-512 instructions at 1.5 GHz) so any time that code path was taken there was the potential you could be slowing things down versus taking other potential instruction paths instead. There's other gotchas that can be hit with overly aggressive optimizations like unwanted code mangling and the like.

            For practical matters though: don't just assume aggressive optimizations are faster even if they seem so. Test, test, test using empirical results.
            I sometimes long for the old times where I knew the exact number of cpu cycles a piece of code would consume by just looking at the mnemonics (M68K and MOS6510). These days the inner workings of the cpu is so complex that even other programs running at the same time on the same caches makes your code behave different.

            Also it sometimes feels like either the cpu or the compilers today optimize better for "worse" code and that obvious optimizations are not always that obvious at all.

            For instance I once created a source code generator for a particular protocol used in my industry (the layout of the protocol is defined using XML so it's known before hand exactly how the on-the-wire protocol will be for each specific implementation) and one of the features of this protocol is that it have a cache (called a dictionary in the protocol lingua) to avoid sending non changed data on the wire. I noticed that on specific combinations of the specification certain circumstances could not happen in a well behaved implementation so I made an obvious optimization in those cases:

            The original generated code:
            Code:
                        if ((*pmap & 32) == 32) {
                            msg->MsgSeqNum = 0;
            
                            do {
                                msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127);
                            } while ((*message++ & 128) == 0 && message < endptr);
            
                            ctx->dictionary[0].type = DICT_ASSIGNED;
                            ctx->dictionary[0].u_integer = msg->MsgSeqNum;
                        } else {
                            if (ctx->dictionary[0].type == DICT_ASSIGNED) {
                                msg->MsgSeqNum = ++ctx->dictionary[0].u_integer;
                            } else if (ctx->dictionary[0].type == DICT_UNDEFINED) {
                                msg->MsgSeqNum = ctx->dictionary[0].u_integer = 0;
                                ctx->dictionary[0].type = DICT_ASSIGNED;
                            }
            
                        }
            The obvious optimized code (since in this case the operator used in the XML template specifies that the cache/dictionary cannot be in an undefined state when the bit in pmap is not set):
            Code:
                        if ((*pmap & 32) == 32) {
                            msg->MsgSeqNum = 0;
            
                            do {
                                msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127);
                            } while ((*message++ & 128) == 0 && message < endptr);
            
                            ctx->decoder_dict[0].u_integer = msg->MsgSeqNum;
                        } else {
                            msg->MsgSeqNum = ++ctx->decoder_dict[0].u_integer;
                        }
            As one can see the optimized version removes a lot of conditionals in the common code path (the common path here is that the bit in pmap is not set). Yet the optimized version is a tiny tiny bit slower for some reason in every benchmark I've run on it.
            Last edited by F.Ultra; 26 December 2018, 12:10 PM.

            Comment


            • #7
              Originally posted by Anty View Post
              Results are inconsistent... One would expect same performance as worst case scenario but here it is hit or miss - in average znver1 opts nullify potential gains. Is this again GCC quirk?
              Inconsistent yes - a surprise no.

              First this is release one. It will take awhile to implement all optimizations.

              Second GCC might not be the best compiler to do an optimized distro on. It would be interesting to see more testing here.

              This is all all very interesting but I suspect that performance testing will be far more interesting when software starts to take C++17 features into account. This especially with I gels latest software donation.

              Comment


              • #8
                Originally posted by F.Ultra View Post

                I sometimes long for the old times where I knew the exact number of cpu cycles a piece of code would consume by just looking at the mnemonics (M68K and MOS6510). These days the inner workings of the cpu is so complex that even other programs running at the same time on the same caches makes your code behave different.

                Also it sometimes feels like either the cpu or the compilers today optimize better for "worse" code and that obvious optimizations are not always that obvious at all.

                For instance I once created a source code generator for a particular protocol used in my industry (the layout of the protocol is defined using XML so it's known before hand exactly how the on-the-wire protocol will be for each specific implementation) and one of the features of this protocol is that it have a cache (called a dictionary in the protocol lingua) to avoid sending non changed data on the wire. I noticed that on specific combinations of the specification certain circumstances could not happen in a well behaved implementation so I made an obvious optimization in those cases:

                The original generated code:
                Code:
                if ((*pmap & 32) == 32) {
                msg->MsgSeqNum = 0;
                
                do {
                msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127);
                } while ((*message++ & 128) == 0 && message < endptr);
                
                ctx->dictionary[0].type = DICT_ASSIGNED;
                ctx->dictionary[0].u_integer = msg->MsgSeqNum;
                } else {
                if (ctx->dictionary[0].type == DICT_ASSIGNED) {
                msg->MsgSeqNum = ++ctx->dictionary[0].u_integer;
                } else if (ctx->dictionary[0].type == DICT_UNDEFINED) {
                msg->MsgSeqNum = ctx->dictionary[0].u_integer = 0;
                ctx->dictionary[0].type = DICT_ASSIGNED;
                }
                
                }
                The obvious optimized code (since in this case the operator used in the XML template specifies that the cache/dictionary cannot be in an undefined state when the bit in pmap is not set):
                Code:
                if ((*pmap & 32) == 32) {
                msg->MsgSeqNum = 0;
                
                do {
                msg->MsgSeqNum = (msg->MsgSeqNum << 7) | (*message & 127);
                } while ((*message++ & 128) == 0 && message < endptr);
                
                ctx->decoder_dict[0].u_integer = msg->MsgSeqNum;
                } else {
                msg->MsgSeqNum = ++ctx->decoder_dict[0].u_integer;
                }
                As one can see the optimized version removes a lot of conditionals in the common code path (the common path here is that the bit in pmap is not set). Yet the optimized version is a tiny tiny bit slower for some reason in every benchmark I've run on it.
                That's a very good example, thank you for sharing. Have you tried the code on differing compilers to see if they all behave with slower results with the second, or perhaps different architectures? (ARMv6/7 v. x86 for example) for merely curiosity sake. Conventional wisdom would suggest that removing the conditionals should speed up the execution, at least that's what one prof used to harp on back in the day, seems that's not strictly the case these days with some modern processors and compilers.

                Comment


                • #9
                  Well that was a let down.

                  Comment


                  • #10
                    This is an alpha release -- obviously it isn't as fine-tuned as it could be. (Hint: Volunteers wanted -- join us in #openmandriva-cooker on freenode or on http://forum.openmandriva.org/)

                    For the initial build, we're essentially relying on clang and gcc to do their job, along with making sure some instructions that aren't in generic x86_64 (SSE*, 3dnow, AVX2, ...) are used in applications that have asm implementations of relevant code. (And also making sure Intel specific drivers don't get built into the znver1 kernel, but that's more about saving space than about performance).
                    There's obviously more that can be done (and will be done in the future).

                    Comment

                    Working...
                    X