Announcement

**Vorpal** · 16 January 2023, 02:38 PM

Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!

Maybe for inference at the edge it could be useful, but there you won't have server processors, but mostly a mix of some mobile x86 and a lot of ARM. And maybe some IOT RISCV these days too. None of which have AMX.

So what is the actual use case?
And where are the GPU comparisons?

**c117152** · 16 January 2023, 03:03 PM

Originally posted by Vorpal View Post

So what is the actual use case?

It's likely they finally reached the SIMD width where the decoder pathway complexity is actually cheaper for proper vector and matrix instructions like RISC-V's RVV.

**Michael** · 16 January 2023, 03:18 PM

Originally posted by Vorpal View Post

And where are the GPU comparisons?

That I don't have any review samples on any of the professional cards to test...

**brucethemoose** · 16 January 2023, 03:40 PM

Originally posted by Michael View Post

That I don't have any review samples on any of the professional cards to test...

ML benchmarks should run "out of the box" on regular RTX cards. Or even AMD cards with ROCM PyTorch builds.

These dont look like they will push over the VRAM pool.

Originally posted by Vorpal View Post

Cool, I guess. But the numbers won't look as good if you also compare to running the same problem on the GPU, which most "serious" ML research does, especially on servers!

Some pre/post processing or utility scripts run on the CPU, and sometimes users run the less expensive bits of a GPU model on CPU to save VRAM.

There are even some work-in-progress frameworks that do this "automatically" on existing codebases, like Microsoft Deepspeed and ColossalAI.

GitHub - microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://github.com/microsoft/DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. - microsoft/DeepSpeed

GitHub - hpcaitech/ColossalAI: Making large AI models cheaper, faster and more accessible

https://github.com/hpcaitech/ColossalAI

Making large AI models cheaper, faster and more accessible - hpcaitech/ColossalAI

Historically, Facebook is kinda infamous for running models too big to fit on GPUs, hence Intel has basically made custom Facebook SKUs like Cooper Lake for years.

In other cases something simple and/or infrequent like face detection is just not worth buying a GPU instance for if the CPU will get it done.

**Michael** · 16 January 2023, 03:42 PM

Originally posted by brucethemoose View Post

ML benchmarks should run "out of the box" on regular RTX cards. Or even AMD cards with ROCM PyTorch builds.

These dont look like they will push over the VRAM pool.

Right, but that I'll obviously get criticism for comparing a $17k processor against a couple hundred dollar consumer card that ultimately likely not too useful of a relevant comparison in practice...

**peterdk** · 16 January 2023, 03:56 PM

They would be relevant I guess if they outperformed the CPU. Which is certainly possible, also on consumer cards.

**brucethemoose** · 16 January 2023, 03:58 PM

Originally posted by Michael View Post

Right, but that I'll obviously get criticism for comparing a $17k processor against a couple hundred dollar consumer card that ultimately likely not too useful of a relevant comparison in practice...

TBH an RTX 2080 (the equivalent of the still very common Nvidia Tesla T4) or an RTX 3060 will probably smoke Sapphire Rapids in these benchmarks.

In CUDA ML world, there is basically zero difference between a high end RTX gaming card and the Quadro/server cards other than VRAM size. They perform the same, barring the top-end HBM cards like the A100 which have no desktop equivalent. They are the same compilation target on the software side of things. So maybe readers will complain, but those complaints will just be nonsense.

What I would really be interested in is a "hybrid" benchmark with ColossalAI or DeepSpeed where some GPU is doing the heavy lifting, and the CPU is handling everything else that doesn't fit into VRAM. Other than the aformentioned light workload scenario, this is where Sapphire Rapids could really shine over EPYC as a ML server host. But I realize this is a tall ask.

**jeisom** · 16 January 2023, 04:47 PM

I am still in the dark for who these instructions are targeted for. AI/ML workloads work much better on dedicated hardware ai accelerator or gpu. If you are in the market for one of these chips, a low end compute engine of some sort would also be in consideration. Probably make more sense just to put a ml accelerator unit on the chips and instructions to more efficiently move the data over. But again that approach would be better for consumer chips, not enterprise. Really surprising that neither AMD nor Intel have released chips with this in mind as it has been done for years on ARM based chips.

It would be nice to see more ai/ml that benefited end users that doesn't run in the cloud. Most of what I see is media processing, image categorization(at least on Apple) and some work in content creation(photoshop).

**brucethemoose** · 16 January 2023, 05:29 PM

Originally posted by jeisom View Post

I am still in the dark for who these instructions are targeted for. AI/ML workloads work much better on dedicated hardware ai accelerator or gpu. If you are in the market for one of these chips, a low end compute engine of some sort would also be in consideration. Probably make more sense just to put a ml accelerator unit on the chips and instructions to more efficiently move the data over. But again that approach would be better for consumer chips, not enterprise. Really surprising that neither AMD nor Intel have released chips with this in mind as it has been done for years on ARM based chips.

It would be nice to see more ai/ml that benefited end users that doesn't run in the cloud. Most of what I see is media processing, image categorization(at least on Apple) and some work in content creation(photoshop).

See above posts, but basically its for hilariously large models like Facebook runs, or for light use instances where a dedicated PCIe accelerator just isn't worth the extra cost, but a lesser CPU doesn't quite cut the mustard. AMX/OpenVINO is also extremely easy to "switch on" in PyTorch and such, which is not the case for some other accelerators.

Centaur made an x86 CPU that is precisely what you are describing, and was excellent on paper, but it never caught on: https://fuse.wikichip.org/news/3256/...s-an-ai-punch/

Intel laptop CPUs have had proprietary AI accelerators for years, but they probably need beefier/less niche designs for more "general" use. And AMD is now shipping an AI accelerator in their laptop chips, but it too needs some software enablement. The only one that "just works" right now is Apple CoreML, and its actually quite good.

And yeah, the content creation AI is coming like a tidal wave. There is already a brewing controversy over text-to-image and the accompanying Photoshop/GIMP/Krita plugins, its just not mainstream yet because its so finicky to set up (largely thanks to Nvidia :/).

Announcement

Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment