I saw all the linked results as well.
Basically, a few percent improvement. About 4-5% over stock.
Free performance is always good, but may be not at the cost of 3X time and 2.5X RAM use.
Phoronix: Link-Time Optimizations With GCC 4.8
GCC 4.8 will feature a few improvements when it comes to LTO, a.k.a. Link-Time Optimization, but will this reflect in any greater performance for the resulting binaries?..
http://www.phoronix.com/vr.php?view=MTI5ODE
I saw all the linked results as well.
Basically, a few percent improvement. About 4-5% over stock.
Free performance is always good, but may be not at the cost of 3X time and 2.5X RAM use.
I seriously doubt Michael is using LTO correctly.
When you are using just a single command to compile, like gcc -march=native -O3 -flto -fwhole-program ... it works fine, but when you use a makefile with separate C(XX)FLAGS and LDFLAGS you need to pass the C(XX)FLAGS along to the LDFLAGS, else the optimization will suffer greatly. So you should do something like this:
CXXFLAGS = -O3 -march=native -flto -fwhole-program
LDFLAGS = $(CXXFLAGS) -Wall
I've done many LTO comparisons and it's not always that there is any gain (alot of the benefits of LTO can be had by just defining functions as static when appropriate) but I've never come across such regressions as shown here in Michael's tests. Hence I'm thinking he is not passing the C(XX)FLAGS along to the linker through the LDFLAGS in the tests which uses a makefile with separate C(XX)FLAGS/LDFLAGS, which in turn means the C(XX)FLAG optimizations aren't being used when generating the final binary.
AFAIK you need to pass the optimization flags aswell, atleast I recall having to do so the last time I benchmarked LTO (which was on 4.7, not 4.8), so:
CXXFLAGS = -O3 -march=native -flto -fwhole-program
LDFLAGS = -O3 -march=native -flto -fwhole-program -Wall (... and whatever other linker options you have)
or just reference the CXXFLAGS variable as I did above:
LDFLAGS = $(CXXFLAGS) -Wall
I believe this is necessary due to the ability of using LTO on object files written in different languages, but I may be wrong. I haven't really dived into LTO as I haven't gotten any major gains from it for my own code, particularly when compared to PGO which pretty much always yield gains, often significant.
I've never heard of PGO until now, but would love to see some recent benchmarks. Most of the articles I saw were reporting up to ~10% gains.
Also, from man gcc:
Code:To use the link-time optimizer, -flto needs to be specified at compile time and during the final link.
No, then you get -O0 optimizations. LTO means link-time optimizations, which means the linker does the optimizations, which again means the linker needs the optimization flags, but the compiler does not.
So
CXXFLAGS = -flto
LDFLAGS = -O3 -march=native -flto -fwhole-program
Would work, but your example would not.
Note you can also speed up the compilation even more by disabling fat object files, by default GCC produces object files that both contain the code for LTO linking and traditional object code, the later is not needed if you are going to use LTO anyway on the final link. Edit: Using -fno-fat-lto-objects as a compile time flag.
Last edited by carewolf; 02-10-2013 at 01:21 PM.
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.htmlAdditionally, the optimization flags used to compile individual files are not necessarily related to those used at link time. For instance,
gcc -c -O0 -flto foo.c
gcc -c -O0 -flto bar.c
gcc -o myprog -flto -O3 foo.o bar.o
This produces individual object files with unoptimized assembler code, but the resulting binary myprog is optimized at -O3. If, instead, the final binary is generated without -flto, then myprog is not optimized.