Announcement

Collapse
No announcement yet.

SNC/NPS Tuning For Ryzen Threadripper 7000 Series To Further Boost Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNC/NPS Tuning For Ryzen Threadripper 7000 Series To Further Boost Performance

    Phoronix: SNC/NPS Tuning For Ryzen Threadripper 7000 Series To Further Boost Performance

    The AMD Ryzen Threadripper 7000 series offer great performance out-of-the-box for Linux desktop/workstation users as shown in my Ryzen Threadripper 7970X and 7980X benchmarks along with the Threadripper PRO 7995WX. While a more common tunable on the EPYC side, the Threadripper 7000 series can also benefit from Nodes Per Socket (NPS) / Sub-NUMA Clustering (SNC) tuning for enhancing the performance of some workloads. In this article is a look at dozens of benchmarks while looking at the performance impact of SNC2/SNC4 adjustments for the Zen 4 Threadripper.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Michael

    Thanks for running these benchmarks. This matches my experience on the Zen 3 Threadripper Pro part that I use at work. NPS4 is a good option if doing a lot of compilation workloads.

    Comment


    • #3
      Thank you Michael, very useful information!
      It would be interesting to look at the geomean.
      While NPS2 never wins in these benchmarks (beyond the margin of error), it looks like it's often closer to the winner than to the slowest setting and thus would make a good default.

      Comment


      • #4
        What's the downside(s) of enabling this? Seems like there isn't a consistent pattern between latency and throughput
        Last edited by Kjell; 01 December 2023, 09:45 AM.

        Comment


        • #5
          Interesting how in the past week, there are now two examples (including this article) of TR showing performance deficits for reasons that I hypothesized here. But what do I know, I was just spouting random nonsense!

          Comment


          • #6
            Originally posted by schmidtbag View Post
            Interesting how in the past week, there are now two examples (including this article) of TR showing performance deficits for reasons that I hypothesized here. But what do I know, I was just spouting random nonsense!
            You give yourself too much credit. A hypothesis makes a testable prediction, based on a sound theory.

            Furthermore, I did not say your ideas were nonsense, just that they were random and not supported by a specific fact pattern. If you list enough ideas, you're liable to get lucky and have the right answer among them. That's obvious, and not terribly helpful.

            Last I checked, checking multiple boxes to a multiple-choice question on a school exam is still wrong. If you go to a doctor with a symptom and you'd like to receive a remedy, it doesn't help if they simply give you back a long list of possible causes and send you on your way. You want them to narrow it down quickly and precisely, based on a sound methodology, so that you can receive an effective treatment based on the right diagnosis.

            Similarly, the reason people are interested in scalability problems is usually because they want to address the bottleneck, if possible. That's why it's important to have the correct diagnosis, and not just a litany of excuses.

            In your rush to pat yourself on the back, you seem not to have noticed some key things differentiating this test from the previous one we discussed:
            1. This used a different storage configuration, consisting of a RAID-0 of Samsung datacenter drives, whereas the previous test used a single client SSD.
            2. This test involves a 96-core ThreadRipper Pro, whereas the previous test involved non-Pro, 32-core and 64-core ThreadRippers. In particular, I was focused on scaling between the 16-core 7950X and the 32-core 7970X, where it's doubtful the latter would be affected much by NPS scalability issues seen here.
            Last edited by coder; 01 December 2023, 02:05 PM.

            Comment


            • #7
              Originally posted by Kjell View Post
              What's the downside(s) of enabling this? Seems like there isn't a consistent pattern between latency and throughput
              My understanding is that NPS=2 gives you two domains of quad-channel memory, while NPS=4 gives you four domains of 2-channel memory. The benefit should be greater parallelism and less contention by splitting the domains.

              The main drawback would be that each domain has less peak bandwidth available. So, if your workload isn't evenly balanced across the NUMA domains, then you could actually see a performance regression with higher NPS numbers. I expect this is the reason there's typically not much gained by going from NPS=2 to NPS=4.

              schmidtbag , this last point is relevant to the no-Pro Threadripper discussion, as those CPUs are starting out with the quad-channel memory configuration more similar to what NPS=2 is delivering, here.

              Comment


              • #8
                Originally posted by coder View Post
                A hypothesis makes a testable prediction, based on a sound theory.
                Agreed - my hypotheses were based on previous similar instances. They were testable, as these two articles have shown. It isn't conclusive but it narrows down the possible causes. Complex issues require you to rule out various possibilities.
                Furthermore, I did not say your ideas were [i]nonsense[/u], just that they were random and not supported by a specific fact pattern. If you list enough ideas, you're liable to get lucky and have the right answer among them. That's obvious, and not terribly helpful.
                It was from a specific pattern but because my example was from a previous architecture, you thought that was worth dismissing.
                Not everyone is aware of what historically has caused issues with such many-core CPUs. The intelligent thing to do is to brainstorm all possibilities and prioritize which ones to test that are either more likely to give conclusive data or are cheap to test.
                Last I checked, checking multiple boxes to a multiple-choice question on a school exam is still wrong. If you go to a doctor with a symptom and you'd like to receive a remedy, it doesn't help if they simply give you back a long list of possible causes and send you on your way. You want them to narrow it down quickly and precisely, based on a sound methodology, so that you can receive an effective treatment based on the right diagnosis.
                Last I checked, this isn't a school exam. It is apparent you don't know how a complex problem is diagnosed - when you see a doctor for common negative symptoms, doctors run many different tests to narrow down the possible causes. If you exhibit "flu like symptoms", that could range from bacteria, fungi, autoimmune disorders, radiation poisoning, chemical poisoning, dehydration, malnutrition, etc. You don't just do a blood test and think that's enough - you test for many things. Threadripper 7000 shows performance discrepancies in ways that aren't necessarily linked to a single problem. The fact you thought testing for I/O bottlenecks was good enough shows rather poor deductive skills on your part. It's absolutely worthwhile to test, but as these last two articles are suggesting: it's not the only problem.
                [*]This used a different storage configuration, consisting of a RAID-0 of Samsung datacenter drives, whereas the previous test used a single client SSD.
                And not all of these tests are disk heavy, so what's your point?
                [*]This test involves a 96-core ThreadRipper Pro, whereas the previous test involved non-Pro, 32-core and 64-core ThreadRippers. In particular, I was focused on scaling between the 16-core 7950X and the 32-core 7970X, where it's doubtful the latter would be affected much by NPS scalability issues seen here.
                Uh... if scaling is the issue, then increasing the core count ought to exacerbate the symptoms, so.... why would increasing the core count invalidate anything? Nobody is saying NPS is the key to the scaling issues, it's just one of many things that may be contributing toward it. Besides, results like this are bound to scale to a 7970X:

                It's just one example, but the point you continue to not understand is that there doesn't appear to be a single cause to the symptoms.

                Comment


                • #9
                  Originally posted by schmidtbag View Post
                  Agreed - my hypotheses were based on previous similar instances. They were testable, as these two articles have shown.
                  Not at all. First, you made no predictions that could be tested to disprove a hypothesis. Second, the two previous articles you cite involve a different CPU with different memory configuration and storage than the original article. The test involving multiple operating systems introduces a multitude of variables that are neither accounted for nor controlled. If these were experiments to test any theory, their design would be considered atrocious.

                  Originally posted by schmidtbag View Post
                  It was from a specific pattern but because my example was from a previous architecture, you thought that was worth dismissing.
                  I didn't say it was wrong, just that you hadn't done the work to show that it applies. You didn't even bother to quote or cite any specific part of that review you linked. Because of that, it seemed more like a misdirection tactic than an actual attempt at an explanation.

                  Originally posted by schmidtbag View Post
                  Last I checked, this isn't a school exam.
                  You're missing the point. I'll repeat:

                  "the reason people are interested in scalability problems is usually because they want to address the bottleneck, if possible. That's why it's important to have the correct diagnosis, and not just a litany of excuses.​"


                  It's clear you're just here to snark and not to solve any real problems. If you don't want to be treated like a clown, stop acting like one.

                  Originally posted by schmidtbag View Post
                  ​It is apparent you don't know how a complex problem is diagnosed
                  I've probably diagnosed more complex systems issues than you've even thought about.

                  Originally posted by schmidtbag View Post
                  ​​The fact you thought testing for I/O bottlenecks was good enough shows rather poor deductive skills on your part.
                  You've repeatedly mischaracterized my statements and positions. This shows you're not posting in good faith and just want to "win" arguments at any cost (even if you're wrong).

                  I never made a conclusive diagnosis. I merely highlighted a possibility that seemed to fit the data and was fairly easy to eliminate. If that didn't resolve the scaling problems, we could move on down the list.

                  Originally posted by schmidtbag View Post
                  ​​And not all of these tests are disk heavy, so what's your point?
                  Many/most of the tests in the 7970X + 7980X review that did exhibit good scaling weren't I/O heavy. That supported my conjecture that I/O could be the dominant bottleneck in the compilation benchmarks.

                  Originally posted by schmidtbag View Post
                  Uh... if scaling is the issue, then increasing the core count ought to exacerbate the symptoms, so.... why would increasing the core count invalidate anything?
                  It affects too many variables. Base clocks, memory configuration, number of CCDs, etc. Did you even notice that the entire chassis, motherboard, and BIOS are different? Not to mention storage.

                  Again, your idea of experiment design is absolutely dismal.

                  Originally posted by schmidtbag View Post
                  Nobody is saying NPS is the key to the scaling issues, it's just one of many things that may be contributing toward it.
                  Wow, so you're still stuck in an "excuses" mindset.

                  Originally posted by schmidtbag View Post
                  ​Besides, results like this are bound to scale to a 7970X:
                  What does that have to do with compilation benchmarks? What does it have to do with anything I said?

                  What sets apart the compilation benchmarks is that they didn't scale as well as most of the other benchmarks in that suite. The key question is why? If you were solutions-focused, you'd have engaged with that question. Instead, you just went into ego-defense mode and spazzed out.
                  Last edited by coder; 01 December 2023, 03:30 PM.

                  Comment


                  • #10
                    I think the obvious question is what impact do these settings have on Windows 11 performance? Would Michael's shootout between Linux distros and Win 11 have had different results?

                    We will probably never know.

                    What i do know is that the people that write the open source software are definitely not seasoned Windows programmers, they are Linux programmers and they write their stuff with the primary objective of having it run on Linux and then they cross compile it so that it can run on Windows.

                    But as anyone that has ever read through any Windows specific NUMA documentation and then looked through the code of various open source projects can tell, open source programmers do not look at any Windows documentation, at least not normally.

                    Comment

                    Working...
                    X