Announcement

**carewolf** · 03 April 2018, 10:56 AM

The CacheBench and C-ray benchmarks are both off by factor of 4 in opposite directions, so that could be a missed autovectorization on the inner most loop. The LAME encoding is weirder.

**edwaleni** · 03 April 2018, 02:01 PM

Originally posted by carewolf View Post

The CacheBench and C-ray benchmarks are both off by factor of 4 in opposite directions, so that could be a missed autovectorization on the inner most loop. The LAME encoding is weirder.

I was guessing the OpenMP command for the use of parallel SIMD was missing. But the LAME MP3 results say otherwise. Very odd.

**acsawdey** · 03 April 2018, 06:18 PM

gcc 7 and 8 are both vectorizing the loop in cachebench write. What they do not do (and llvm apparently does) is unroll it. If you build with -funroll-all-loops the cachebench write improves by 3x on a p8, should be similar on p9.

**ThinkOpenly** · 04 April 2018, 11:51 AM

Is optimization enabled for CacheBench, 7-Zip Compression, LAME, or Redis? The "gcc options" footnote doesn't show that it is.

**eSyr** · 06 April 2018, 09:25 PM

Have you tried using -O3? I've heard it makes bigger difference on POWER.

Announcement

GCC 7.3 vs. GCC 8.0 vs. LLVM Clang 6.0 On The POWER9 Raptor Talos II

GCC 7.3 vs. GCC 8.0 vs. LLVM Clang 6.0 On The POWER9 Raptor Talos II

Comment

Comment

Comment

Comment

Comment