IBM Scores More POWER Open-Source Performance Optimizations

Written by Michael Larabel in Hardware on 19 August 2018 at 08:31 AM EDT. 7 Comments
HARDWARE
Following our POWER9 Linux benchmarks earlier this year, IBM POWER engineers have continued exploring various areas for optimization within the interesting open-source workloads tested. Another batch of optimizations are pending for various projects.

Recently there was the 3.3x performance improvement for FLAC with the POWER architecture while they have had a number of other optimization victories too. Thanks to the Phoronix Test Suite being open-source and freely available, it was easy for the IBM engineers to dig in and analyze the various workloads under test to scout out optimizations. Some of their recent optimizations and other discoveries include:

- For the Parboil scientific benchmark, POWER performs better if using a larger data-set thanks to the POWER9 CPUs having many CPU threads available. With compiler tuning, the performance is also better, rather than just using the upstream program's stock flags.

- With x264 there is slightly better performance if using auto vectorization but generally speaking there is more POWER hand-tuning that can be done with x264 while obviously the x86_64 code has already been hand-tuned. x264 also doesn't scale well to the 100+ threads found commonly on POWER9 systems.

- They have a patch pending for Primesieve to consider SMT threads when blocking for L1 cache. This patch can improve the POWER9 performance by about 12%.

- With LAME MP3 encoding they found the program's configure script leads to no optimizations. They have a patch pending upstream that with the optimizations can achieve 6~8% better performance.

- For FLAC audio encoding they have patch work pending to add POWER-specific vector instructions to the FLAC encoder to yield around a 3x performance increase.

- With the latest OpenSSL code in 1.1.1-pre1 or applying patches to 1.1.0f there are some POWER-specific optimizations.

- For the SciKit-Learn benchmark, if tuning the distribution's BLAS library there is room for significantly faster performance or swapping out the BLAS library in use for libatlas or libopenblas.

- When doing our POWER benchmarking comparisons we use the system/blender test profile that uses the distribution's default Blender installation rather than pts/blender that downloads the (x86/x86_64) Blender binaries for the platform. (Thanks to PTS versioning, it's always the same test profile version/copy being used and compared with what's in the result file.) With the system/blender test there was an oversight causing the GPU blend files to be used always rather than the CPU blend files. That's now been updated and thus better leveraging CPU threads/cores across all platforms when opting to run system/blender.

So overall there is more POWER tuning happening plus for a few programs found some generic changes/improvements that will benefit all architectures. Great to see happen all around. I'm still waiting on the IBM POWER hardware but once that arrives I'll certainly be running some fresh tests and look forward to their POWER patches working their way upstream into the various projects. All the details can be found in this blog post.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week