LLVM Clang Shows Off Great Performance Advantage On NVIDIA GH200's Neoverse-V2 Cores

Written by Michael Larabel in Software on 18 March 2024 at 11:20 AM EDT. Page 2 of 4. 14 Comments.
QuantLib benchmark with settings of Configuration: Multi-Threaded. Clang 17 was the fastest.
QuantLib benchmark with settings of Configuration: Single-Threaded. Clang 17 was the fastest.

Right away Clang was showing the ability to outperform the GCC-built benchmarks/workloads on this NVIDIA GH200 server.

miniBUDE benchmark with settings of Implementation: OpenMP, Input Deck: BM1. Clang 17 was the fastest.
miniBUDE benchmark with settings of Implementation: OpenMP, Input Deck: BM1. Clang 17 was the fastest.
miniBUDE benchmark with settings of Implementation: OpenMP, Input Deck: BM2. Clang 17 was the fastest.
miniBUDE benchmark with settings of Implementation: OpenMP, Input Deck: BM2. Clang 17 was the fastest.
LULESH benchmark with settings of . GCC 13 was the fastest.
LAMMPS Molecular Dynamics Simulator benchmark with settings of Model: 20k Atoms. Clang 17 was the fastest.
LAMMPS Molecular Dynamics Simulator benchmark with settings of Model: Rhodopsin Protein. Clang 17 was the fastest.

Across various HPC workloads the Clang AArch64 binaries were significantly faster than using the current GCC 13 stable series. Then again we've seen competitive x86_64 and AArch64 performance for a while to GCC though typically not to some of the extremes seen in this round of testing. Though given Clang being more common on AArch64 due to its use by Apple, Android, etc, the nice performance wins aren't too surprising.

Zstd Compression benchmark with settings of Compression Level: 19, Compression Speed. Clang 17 was the fastest.
Zstd Compression benchmark with settings of Compression Level: 19, Decompression Speed. GCC 13 was the fastest.
Zstd Compression benchmark with settings of Compression Level: 19, Long Mode, Compression Speed. Clang 17 was the fastest.
Zstd Compression benchmark with settings of Compression Level: 19, Long Mode, Decompression Speed. GCC 13 was the fastest.

GCC 13 did pick up a few wins in some of the Zstd compression benchmarks.

WebP Image Encode benchmark with settings of Encode Settings: Default. Clang 17 was the fastest.
WebP Image Encode benchmark with settings of Encode Settings: Quality 100. Clang 17 was the fastest.
WebP Image Encode benchmark with settings of Encode Settings: Quality 100, Lossless. Clang 17 was the fastest.
WebP Image Encode benchmark with settings of Encode Settings: Quality 100, Highest Compression. Clang 17 was the fastest.
WebP Image Encode benchmark with settings of Encode Settings: Quality 100, Lossless, Highest Compression. Clang 17 was the fastest.

Even for workloads like WebP image generation the Clang-built binaries were faster on the Neoverse-V2 CPUs.


Related Articles