Announcement

Collapse
No announcement yet.

Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

    Phoronix: Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

    One of the most exciting features of Intel's 4th Gen Xeon Scalable "Sapphire Rapids" processors is the introduction of Advanced Matrix Extensions (AMX). The Intel AMX ISA extensions are intended for speeding-up AI and machine learning related workloads. In this article is a look at the AMX performance on the Xeon Platinum 8490H processors on/off for machine learning performance.

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!

    Maybe for inference at the edge it could be useful, but there you won't have server processors, but mostly a mix of some mobile x86 and a lot of ARM. And maybe some IOT RISCV these days too. None of which have AMX.

    So what is the actual use case?
    And where are the GPU comparisons?
    Last edited by Vorpal; 16 January 2023, 02:39 PM. Reason: Fix typo

    Comment


    • #3
      Originally posted by Vorpal View Post
      So what is the actual use case?
      It's likely they finally reached the SIMD width where the decoder pathway complexity is actually cheaper for proper vector and matrix instructions like RISC-V's RVV.

      Comment


      • #4
        Originally posted by Vorpal View Post
        And where are the GPU comparisons?
        That I don't have any review samples on any of the professional cards to test...
        Michael Larabel
        https://www.michaellarabel.com/

        Comment


        • #5
          Originally posted by Michael View Post

          That I don't have any review samples on any of the professional cards to test...
          ML benchmarks should run "out of the box" on regular RTX cards. Or even AMD cards with ROCM PyTorch builds.

          These dont look like they will push over the VRAM pool.


          Originally posted by Vorpal View Post
          Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!
          Some pre/post processing or utility scripts run on the CPU, and sometimes users run the less expensive bits of a GPU model on CPU to save VRAM.

          There are even some work-in-progress frameworks that do this "automatically" on existing codebases, like Microsoft Deepspeed and ColossalAI.

          DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. - microsoft/DeepSpeed




          Historically, Facebook is kinda infamous for running models too big to fit on GPUs, hence Intel has basically made custom Facebook SKUs like Cooper Lake for years.

          In other cases something simple and/or infrequent like face detection is just not worth buying a GPU instance for if the CPU will get it done.
          Last edited by brucethemoose; 16 January 2023, 03:45 PM.

          Comment


          • #6
            Originally posted by brucethemoose View Post

            ML benchmarks should run "out of the box" on regular RTX cards. Or even AMD cards with ROCM PyTorch builds.

            These dont look like they will push over the VRAM pool.
            Right, but that I'll obviously get criticism for comparing a $17k processor against a couple hundred dollar consumer card that ultimately likely not too useful of a relevant comparison in practice...
            Michael Larabel
            https://www.michaellarabel.com/

            Comment


            • #7
              They would be relevant I guess if they outperformed the CPU. Which is certainly possible, also on consumer cards.

              Comment


              • #8
                Originally posted by Michael View Post

                Right, but that I'll obviously get criticism for comparing a $17k processor against a couple hundred dollar consumer card that ultimately likely not too useful of a relevant comparison in practice...
                TBH an RTX 2080 (the equivalent of the still very common Nvidia Tesla T4) or an RTX 3060 will probably smoke Sapphire Rapids in these benchmarks.

                In CUDA ML world, there is basically zero difference between a high end RTX gaming card and the Quadro/server cards other than VRAM size. They perform the same, barring the top-end HBM cards like the A100 which have no desktop equivalent. They are the same compilation target on the software side of things. So maybe readers will complain, but those complaints will just be nonsense.



                What I would really be interested in is a "hybrid" benchmark with ColossalAI or DeepSpeed where some GPU is doing the heavy lifting, and the CPU is handling everything else that doesn't fit into VRAM. Other than the aformentioned light workload scenario, this is where Sapphire Rapids could really shine over EPYC as a ML server host. But I realize this is a tall ask.
                Last edited by brucethemoose; 16 January 2023, 04:01 PM.

                Comment


                • #9
                  I am still in the dark for who these instructions are targeted for. AI/ML workloads work much better on dedicated hardware ai accelerator or gpu. If you are in the market for one of these chips, a low end compute engine of some sort would also be in consideration. Probably make more sense just to put a ml accelerator unit on the chips and instructions to more efficiently move the data over. But again that approach would be better for consumer chips, not enterprise. Really surprising that neither AMD nor Intel have released chips with this in mind as it has been done for years on ARM based chips.

                  It would be nice to see more ai/ml that benefited end users that doesn't run in the cloud. Most of what I see is media processing, image categorization(at least on Apple) and some work in content creation(photoshop).

                  Comment


                  • #10
                    Originally posted by jeisom View Post
                    I am still in the dark for who these instructions are targeted for. AI/ML workloads work much better on dedicated hardware ai accelerator or gpu. If you are in the market for one of these chips, a low end compute engine of some sort would also be in consideration. Probably make more sense just to put a ml accelerator unit on the chips and instructions to more efficiently move the data over. But again that approach would be better for consumer chips, not enterprise. Really surprising that neither AMD nor Intel have released chips with this in mind as it has been done for years on ARM based chips.

                    It would be nice to see more ai/ml that benefited end users that doesn't run in the cloud. Most of what I see is media processing, image categorization(at least on Apple) and some work in content creation(photoshop).
                    See above posts, but basically its for hilariously large models like Facebook runs, or for light use instances where a dedicated PCIe accelerator just isn't worth the extra cost, but a lesser CPU doesn't quite cut the mustard. AMX/OpenVINO is also extremely easy to "switch on" in PyTorch and such, which is not the case for some other accelerators.

                    Centaur made an x86 CPU that is precisely what you are describing, and was excellent on paper, but it never caught on: https://fuse.wikichip.org/news/3256/...s-an-ai-punch/

                    Intel laptop CPUs have had proprietary AI accelerators for years, but they probably need beefier/less niche designs for more "general" use. And AMD is now shipping an AI accelerator in their laptop chips, but it too needs some software enablement. The only one that "just works" right now is Apple CoreML, and its actually quite good.



                    And yeah, the content creation AI is coming like a tidal wave. There is already a brewing controversy over text-to-image and the accompanying Photoshop/GIMP/Krita plugins, its just not mainstream yet because its so finicky to set up (largely thanks to Nvidia :/).
                    Last edited by brucethemoose; 16 January 2023, 05:32 PM.

                    Comment

                    Working...
                    X