AMD Bulldozer "bdver1" Compiler Performance
Phoronix: AMD Bulldozer "bdver1" Compiler Performance
It is time for another round of compiler benchmarks on AMD's latest FX-8150 Bulldozer processor. In this article is comparing the GCC 4.6.1, GCC 4.7 development, Open64 4.2.4, and AMD Open64 126.96.36.199 compilers in their stock configuration, when the binaries are built again but with the march/mtune flags set to "bdver1", and a third when being built with the "bdver1" architecture and tuning flags along with "-Ofast" for the highest-level of compiler optimizations.
Again, these types of benchmarks just makes me wonder if Micheal knows anything of compiler flags.
We see HUGE differences between stock/bdver1+Ofast and bdver1 on some tests, and I'm thinking that Micheal just can't be so stupid that 'bdver1' means he is ONLY setting "-march=bdver1" and not setting an -O flag. But I fear that this is exactly what's happening on those tests.
Stock settings I assume are the CFLAGS/CXXFLAGS which the test packages ship with, which likely are -O2 or -O3. Setting -Ofast on GCC equals '-O3 -ffast-math'. When NOT setting a -O flag on GCC, it defaults to NO optimization (-O0).
So in the tests where (I assume) Micheal only sets CFLAGS to '-march=bdver1' the optimization is set to default -O0 on GCC, which is NO optimization, making that benchmark totally pointless. '-march=' only tells the compiler which cpu target to optimize for, -O is what actually turns on optimization at different levels of strenght.
Meanwhile I'm glad that he is now venturing outside the stock-settings when benchmarking which is good since stock settings (as in tarball from author) can be tuned to a particular compiler or set to being very low (even -O0) with the author(s) expecting the end user/packager to set appropriate optimization levels according to their needs.
To make the Phoronix tests atleast somewhat reflect the capacity of the compilers then atleast -O2 and -O3 should be tested for each package. For tests relying on floating point math I would say -ffast-math should be warranted aswell.
Michael needs to spend some quality time with Gentoo to learn which CFLAGS actually matter.
On that note, the results would be semi-useful if he benchmarked with -O2 -march=native. Most Gentoo users tend to use -O2 -march=native. That makes GCC pick what it thinks is best and GCC does a good job of it. For benchmarking purposes, it would be better to figure out the flags that -march=native sets and then set those manually. That way people don't need to guess what GCC was thinking on the CPU. He can do that by running "gcc -Q -v -march=native -O2 --help=target", which will usually output the real flags that GCC uses.
Last edited by Shining Arcanine; 11-15-2011 at 05:45 AM.
I wonder if the Gcrypt tests failed due to --fast-math
Any one on here with Gentoo and PTS installed that can repeat these tests with proper compiler flags set?
Any idea why Open64 5 wasn't tested?
So am I reading the povray graph right when I say: going from generic binaries + optimization to bdver1 binaries + no optimization makes it 25% faster and then from bdver1 binaries + no optimizations to bdver1 binaries + Ofast optimization is only very little improvement?
Originally Posted by XorEaxEax
Also afaik "-march=native -mtune=native" is redundant, since -mtune only optimizes in the range -march allows, which is only 1 when giving -march=native.
Interesting would be to compare performance with generic CFLAGS with the CPU-specific ones. Benchmarking -O0 against -Ofast is mostly pointless.
In particular, the following I'd like to see compared:
- -O2 this is what most binary distributions use
- -O2 -mtune=native this will produce code which is optimized for the current platform, but still runs on others (maybe slower)
- -O2 -march=native this can in addition use instructions which are specific to the CPU
No, obviously on the Povray test there was a -OX flag set since the difference between the '-bdver1' and '-bdver1 + Ofast' where so small, assuming that the -O level was -O3 for both, the difference would be due to -ffast-math which is enabled with -Ofast and can improve floating point math performance.
Originally Posted by ChrisXY
However, looking at the other tests (GraphicsMagick, GCrypt) it's obvious that it's not using the same optimization (-O) since that's the only thing that can differ between '-march=bdver1' and '-march=bdver1 -Ofast'. And looking at the HUGE difference in performance between those benchmarks it's obvious that there's totally different optimization levels at play, and knowing that GCC defaults to -O0 (NO optimization) when no -O level is set (which is indicated by Michael omitting any -O setting in those increadibly poor performance tests) it makes sense that this is the case.
So the results of the GraphicsMagick, GCrypt tests indicate that the flags were:
'stock' = -O2 or -O3 I assume
'bdver1 + Ofast' = -Ofast -march=bdver1
'bdver1' = -march=bdver1 (thus GCC defaulting to -O0 which is no optimization at all)
Again it's so sad that there's no reason to the way Micheal organizes these tests. To properly test each compiler you would atleast compile each package using -O2 and -O3, preferably add test where -march=native is used aswell. As it is Michael states that he uses whatever settings the packages came with which may or may not be the most optimal settings for compiler A, but not for compiler B, and he doesn't even declare those settings in the tests so we can judge the results with proper data at our disposal.
Yes, which is why I only mention '-march'.
Originally Posted by ChrisXY
All performance tests should always be done on optimized code (most used level is -O2).
Option -march=bdver1 alone does not turn on any code optimizations, such as constant propagation or even dead code removal. Optimizations rewrite the code into using fewer instructions, which gives rather dramatic results. The march option will suggest for the compiler which instructions are faster than others doing the same job, effect of which is negligible compared to the time the instructions take to execute when they don't have to.
I agree the results in graphicsmagick suggest optimizations where turned off for bdver standard.
anything new on the "bedevere" optimisations?
a fully optimised bulldozer system would be interesting to see in action...