GCC 4.6 Compiler Performance With AVX
Phoronix: GCC 4.6 Compiler Performance With AVX
While we are still battling issues with the Intel Linux graphics driver in getting that running properly with Intel's new Sandy Bridge CPUs (at least Intel's Jesse Barnes is now able to reproduce the most serious problem we've been facing, but we'll save the new graphics information for another article), the CPU performance continues to be very compelling. Two weeks ago we published the Intel Core i5 2500K Linux benchmarks that showed just how well this quad-core CPU that costs a little more than $200 USD is able to truly outperform previous generations of Intel hardware. That was just with running the standard open-source benchmarks and other Linux software, which has not been optimized for Intel's latest micro-architecture. Version 4.6 of the GNU Compiler Collection (GCC) though is gearing up for release and it will bring support for the AVX extensions. In this article, we are benchmarking GCC 4.6 on a Sandy Bridge system to see what benefits there are to enabling the Core i7 AVX optimizations.
GCC 4.6.x Gcrypt, GraphicsMagick and HMMer results are so abysmal, bugs must be filed immediately.
Many of these results are very unsurprising.
One would not expect any kind of change in an HTTP server from AVX/SSE/MMX/AltiVec/NEON, except perhaps in SSL performance or gzip compression performance (which may not be tested by that benchmark, I suspect).
In many other cases, AVX is basically going to perform identically to SSE2. In a few cases with auto-vectorization of code it's possible for the compiler to cut the number of vector instructions down dramatically (twice as many components per vector in AVX as in SSE).
In most cases with highly optimized code bases, they are using hand-rolled SSE code. So compiling with AVX turned on will have no effect because the code is explicitly using 128-bit SSE.
For a lot of things like graphics where the code is very explicitly written around four-component vectors, the most AVX is going to offer is the ability to use double-precision floats instead of single-precision floats, but nobody is actually using double-precision because that eats up twice as much memory, twice as much bandwidth to the GPU, and it just makes things even more slower because the GPUs don't use double-precision floats internally so the driver has to manually convert those buffers from double to single precision before uploading to the GPU. A particularly clever programmer could manage to combine many vector operations to using AVX to perform two such operations simultaneously, but that code will be complex and trying to write/maintain it will be pure hell compared to using an SSE-based vector class.
The apps that will benefit the most from AVX extensions with this option are applications that (a) have no been hand-optimized to already use SSE primitives and (b) which make use of large arrays processed in loops which can actually be auto-vectorized.
I'm fairly sure that will mostly boil down to scientific applications and a handful of unsupported and previously slow as crap codec libraries.
Well, Intel hired CodeSourcery (long time contributors to GCC) to work specifically on GCC optimizations for the CoreiX range and afaik those optimizations where not ready in time to be included in GCC 4.6 so chances are there's alot more performance coming our way in future GCC releases.
Please can you try these tests but without lto switched on
From my testing (even on 4.6) it causes huge regressions in the speed and size of all executables, it isn't as bad when using gold as the linker but glibc can't be compiled with that yet
Also what flags are being used to compile the software you're running?
That is obviously because you are not using flto properly. You need to pass the optimization flags (CFLAGS, CXXFLAGS) to the linker aswell (LDFLAGS) else the resulting code will not be optimized at all and thus be larger and slower. Read the documentation on flto. Afaik this won't be necessary in GCC 4.6 when it is released, but it certainly is on 4.5.x. As for performance improvements with flto, it very much depends on the program as per usual, but it pretty much always manages to cut down the executable by a good margin.
Originally Posted by FireBurn
You're fully pointing out the issues: AVX is just for spots of code where it can use it's double wide bandwidth. Also at least AMD said that first gen AVX will be implemented internally in microcode as two SSE calls, and as I do not have any Intel info about how they did it, probably even hitting AVX optimizations will not show that dramatic gains.
Originally Posted by elanthis
At the end I just hope that benchmarks will focus more to extrapolate those gains using to maximum those gains.
For example FFMPEG permits to be compiled with no ASM, and probably if it will touch some autovectorize compiler patterns, will likely get some speedup. Similar with a renderer or scientific code.
As Phoronix uses Linux, I think that the main speedup will unlikely be noticed that whole desktop works with just SSE2 that Atom CPU support, as even some components are written in Python and so on.
Also, as results get fairly predictable, it will be better just to benchmark for example when a kernel will pick a new scheduling strategy (as was BFS), to test it. Elsewhere most of those results will be just noise and at large I personally think that will hurt the compiling and the hardwork of GCC team.
I found lately much more fun to test for myself the JS performance of Firefox that those benchmarks. And much more people will be impacted to see how a real browser will work.
Mono have an LLVM JITting support. How much the start-time of a big app (MonoDevelop comes in my mind) is impacted. What about to test its raw number performance compared with GCC/C++ port of some code or other kind of code like this.
Tags for this Thread