Intel Advanced Matrix Extensions [AMX] Performance With Xeon Scalable Sapphire Rapids

Written by Michael Larabel in Processors on 16 January 2023 at 04:00 PM EST. Page 2 of 6. 20 Comments.

At this early stage of software maturity around Advanced Matrix Extensions, the most notable user of it is Intel's own oneDNN library. Intel's oneAPI Deep Neural Network Library (oneDNN) provides optimized implementations of "deep learning building blocks" in turn used by various deep learning applications. Among the applications that in turn rely on oneDNN include Intel's openVINO along with other prominent software like TensorFlow, PyTorch, MATLAB, Microsoft's ONNX Runtime, PaddlePaddle, Deeeplearning4J, Apache MXNet, and others.

For the Xeon Scalable Sapphire Rapids launch material they were citing "up to 8.6x" and "up to 10x" higher performance with Advanced Matrix Extensions. Though it's important to note they are comparing FP32 vs. BF16 with AMX, and just not BF16 with/without AMX enabled. In today's article is looking at all the same data types and just comparing the exposed CPU ISA extensions.

Intel engineers have been working on oneDNN support for 4th Gen Xeon Scalable for a while and have had Advanced Matrix Extensions support prepared for quite some time going back to the 2.x releases. Last month marked the oneDNN 3.0 release where it has improved performance for Sapphire Rapids as well as early support for Granite Rapids, including with the AMX-FP16 support being introduced there. The oneDNN library also continues to receive optimizations for the Intel Data Center GPU Max Series and Arc Graphics too.

So while the support for AMX on the software side is still limited in scope, at least with it being found in oneDNN in turn means it can already be leveraged by a range of deep learning software that leverages this Intel oneAPI open-source library. Intel's OpenVINO is already updated with AMX / Sapphire Rapids support while Microsoft's ONNX Runtime 1.14 coming out soon will also pull in the AMX support. OpenVINO stands for the "Open Visual Inference and Neural network Optimization" toolkit and contains a model optimizer and inference engine suited for maximizing deep learning performance on Intel hardware.

Making use of oneDNN has been my primary focus of benchmarking so far for looking at the AMX performance impact. Besides having optimized AMX support already, making oneDNN ideal for benchmarking (and the fact I've been using it for benchmarking for years...) is that this library has a JIT CPU dispatcher control. The oneDNN CPU dispatcher allows at run-time dynamically selecting the best / most optimal ISA for use with a given processor. Via the "ONEDNN_MAX_CPU_ISA" environment variable is support for being able to override the desired ISA for oneDNN. Thus it's easy with oneDNN to evaluate the performance with AMX and varying AVX levels. The ONEDNN_MAX_CPU_ISA levels for Xeon Scalable Sapphire Rapids benchmarking is AVX512_CORE_AMX for the full capabilities or AVX512_CORE_FP16 (AVX-512 FP16 also new to Sapphire Rapids), AVX512_CORE_BF16 for AVX-512 BF16 that was added to Cooper Lake but missed Ice Lake and now present with Sapphire Rapids, AVX512_CORE_VNNI for AVX-512 with VNNI that has been around since Cascade lake, and then AVX512_CORE for standard AVX-512 with its original core extensions. The oneDNN dispatcher control allows falling back to AVX1/AVX2 too, but at least in my testing for the benchmarks being run they were failing when trying to force the tests without AVX-512. With OpenVINO it's possible to still force the max CPU ISA level for oneDNN via this CPU dispatcher control.

Thus today's article is focusing on the performance of the Xeon Platinum 8490H processors with AMX compared to restricting oneDNN/OpenVINO to without and across varying levels of AVX-512 support. While doing this AMX "on/off" comparison, the CPU power consumption for both processors were monitored via the RAPL interface, the CPU core temperatures monitored, and also recording the peak CPU frequency every second as the highest CPU clock frequency observed across any of the 120 CPU cores in this dual socket Intel Eagle Stream server.

The oneDNN 3.0 and OpenVINO 2022.3 benchmarking took place on Ubuntu 22.04 LTS with its GCC 11 compiler and upgrading to the Linux 6.1.4 kernel as the very latest upstream stable release as of testing for this Xeon Platinum 8490H Sapphire Rapids server.


Related Articles