Intel oneDNN 3.3 Brings More Performance Optimizations For Sapphire Rapids / AMX

Written by Michael Larabel in Intel on 7 October 2023 at 12:51 PM EDT. Add A Comment

In addition to x86-simd-sort 3.0 being released for speedy AVX-512 sorting, Friday also brought the release of oneDNN 3.3 as the deep neural network library that is part of oneAPI and focused on helping developers build out deep learning applications.

Intel oneDNN continues to support CPU-based execution on not only x86_64 but also AArch64 and POWER and RISC-V while also supporting AMD and NVIDIA GPU execution in addition to its Intel graphics support. The oneDNN library is heavily tuned for making the most of Intel hardware and with oneDNN 3.3 there is more Advanced Matrix Extensions (AMX) tuning and other alterations to benefit the latest generation Xeon Scalable "Sapphire Rapids" processors. Plus oneDNN 3.3 rolls out more early optimization work for next-generation Granite Rapids and Sierra Forest processors coming in 2024.

Intel Xeon Max CPUs

The oneDNN 3.3 performance optimization work includes:

Intel Architecture Processors:
Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
Improved s32 binary primitive performance.
Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
Improved performance of convolution for depthwise cases with Graph API.
[experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
Intel Graphics Products:
Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
Reduced RNN primitive initialization time on Intel GPUs.
AArch64-based Processors:
Improved fp32 to bf16 reorder performance.
Improved max pooling performance with Arm Compute Library (ACL).
Improved dilated convolution performance for depthwise cases with ACL.

The oneDNN 3.3 release also adds group normalization primitive support, extended verbose mode output, new examples for the oneDNN Graph API, and other changes.

Downloads and more details on the oneDNN 3.3 release via GitHub.

Add A Comment