Announcement

Collapse
No announcement yet.

With A Few Lines Of Code, AMD's Nice Performance Optimization For Linux 5.20

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • With A Few Lines Of Code, AMD's Nice Performance Optimization For Linux 5.20

    Phoronix: With A Few Lines Of Code, AMD's Nice Performance Optimization For Linux 5.20

    A patch from AMD to further tune the Linux kernel's scheduler around NUMA imbalancing has been queued up and slated for introduction in Linux 5.20. For some workloads this scheduler tuning can help out significantly for AMD Zen-based systems and even on Intel Xeon servers has the possibility of helping too...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Is this strictly related to multiple socket systems (>= 2 sockets) or can 1-socket desktop systems with a multiple llc cpu (zen 5950x, 5900x, 3950x 3900x) also benefit from this?

    Comment


    • #3
      Originally posted by bezirg View Post
      Is this strictly related to multiple socket systems (>= 2 sockets) or can 1-socket desktop systems with a multiple llc cpu (zen 5950x, 5900x, 3950x 3900x) also benefit from this?
      The way I read it: any cpu with multiple chiplets. I believe some of the lower end Ryzen CPUs have only a single chiplets.

      Comment


      • #4
        Originally posted by Vorpal View Post

        The way I read it: any cpu with multiple chiplets. I believe some of the lower end Ryzen CPUs have only a single chiplets.
        If this is true, can Michael throw in the benchmarks a plain desktop zen 3 5950x to see the impact? These would include way more interested phoronix readers

        Comment


        • #5
          Originally posted by bezirg View Post
          Is this strictly related to multiple socket systems (>= 2 sockets) or can 1-socket desktop systems with a multiple llc cpu (zen 5950x, 5900x, 3950x 3900x) also benefit from this?
          As I understand it, it won't help Ryzen 3000 and up, as such CPUs don't have any NUMA domains internally. All RAM accesses go through the I/O die and are therefore equally distant from all chiplets. Threadripper 1000 and 2000 worked as NUMA, where each CPU in the package had its own RAM channels, so while data was transferred between the CPUs, access to the CPU's local channels was faster. However whether this helps TR 1000 and 2000 depends on how current scheduling treats them, which I don't know.

          Comment


          • #6
            Originally posted by Vorpal View Post

            The way I read it: any cpu with multiple chiplets. I believe some of the lower end Ryzen CPUs have only a single chiplets.
            Yea. On our Epyc servers I see big boost by enabling NUMA emulation per chiplet/CCX.

            Comment


            • #7
              Important to remember this is for the fair scheduler. Most people will be using schedutil

              Comment


              • #8
                Originally posted by scottishduck View Post
                Important to remember this is for the fair scheduler. Most people will be using schedutil
                Completely different things... This isn't a CPU frequency driver / governor optimization.
                Michael Larabel
                https://www.michaellarabel.com/

                Comment


                • #9
                  Originally posted by ET3D View Post

                  As I understand it, it won't help Ryzen 3000 and up, as such CPUs don't have any NUMA domains internally. All RAM accesses go through the I/O die and are therefore equally distant from all chiplets. Threadripper 1000 and 2000 worked as NUMA, where each CPU in the package had its own RAM channels, so while data was transferred between the CPUs, access to the CPU's local channels was faster. However whether this helps TR 1000 and 2000 depends on how current scheduling treats them, which I don't know.

                  What ?!?
                  How is this about RAM access?
                  This is about cache access!

                  In the Ryzen 5000 series the CPUs up to Ryzen 5800X use only 1 chiplet with 8 cores and 32 MB L3 cache and thus won't benefit,
                  but Ryzen 5900X and 5950X use 2 chiplets, with 12 or 16 cores respectively and 2 x 32 MB L3 cache.

                  You will absolutely see a NUMA optimization benefit on 5900X and 5950X, since a process can be pegged to a chiplet that has the data cached in its L3 cache.
                  There is a significant latency penalty from accessing the L3 cache from a different chiplet, so you always want to have the process running on the chiplet with the appropriate cache.

                  The bandwidth of the infinity fabric is also limited and better be avoided if you can help it by not utilizing the infinity fabric.
                  Simply have process on the same chiplet as the cached data for the process.
                  Last edited by Degra; 13 June 2022, 09:50 AM.

                  Comment


                  • #10
                    Originally posted by Degra View Post


                    What ?!?
                    How is this about RAM access?
                    This is about cache access!

                    In the Ryzen 5000 series the CPUs up to Ryzen 5800X use only 1 chiplet with 8 cores and 32 MB L3 cache and thus won't benefit,
                    but Ryzen 5900X and 5950X use 2 chiplets, with 12 or 16 cores respectively and 2 x 32 MB L3 cache.

                    You will absolutely see a NUMA optimization benefit on 5900X and 5950X, since a process can be pegged to a chiplet that has the data cached in its L3 cache.
                    There is a significant latency penalty from accessing the L3 cache from a different chiplet, so you always want to have the process running on the chiplet with the appropriate cache.

                    The bandwidth of the infinity fabric is also limited and better be avoided if you can help it by not utilizing the infinity fabric.
                    Simply have process on the same chiplet as the cached data for the process.
                    True, that's why RPCS3 (PS3 emulator) tends to perform significantly worse on these multi-CCX Ryzens & Threadrippers, because this software actually makes proper use of all available cores & threads while doing real-time 3D rendering, unlike most "modern" AAA games.

                    Comment

                    Working...
                    X