I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.
Phoronix: Optimizing Mesa Performance With Compiler Flags
Compiler tuning can lead to performance improvements for many computational benchmarks by toying with the CFLAGS/CXXFLAGS, but is there much gain out of optimizing your Mesa build? Here's some benchmark results...
http://www.phoronix.com/vr.php?view=MTI4NTY
I would be very interested in how the -O1, -O2, -O3 compares to -Os (optimize for size). When code is smaller you get fewer cache misses which leads to faster execution. Ruby is known to run faster with -Os.
I guess the bottleneck of most videogames is not OpenGL, unless the game is designed for high-end graphics card. Check this with any profiler: gl... calls are almost unnoticeable amoung game physics and logic. Compiling the actual software and main libraries instead of driver could give a very different result.
so the flags do exactly what the manpage says: -O2 is a good, stable optimization, while -O3 needs more compile time and may or may not improve the resulting binary so it is mostly a waste of energy and time (except you like playing and consider compiling Linux with all flag permutations as a game). I would only enable it for single applications if I am not satisfied with -O2 (it seemed that ffmpeg gained a little performance from -O3 but I did not benchmark this).
In my experience in most cases -O3 does not improve the performance noticably (like in the article) and additionally the -Os and -O3 flags can break programs because of unpredicted segfaults.
So the only compile flags I use for years are -march=..., -O2 and for gcc: -pipe
For software it is better anyways to use efficient algorithms to solve a problem, no compiler optimization can improve an exponential algorithm into a linear one, it just creates a little better exponential code (or not).
There are quite big differences between O2 and O3 with some software, especially if it's C++ with templates.
Bullet physics was close to 10x slower with O2, same result with Os, when compared to O3 last I tested.
While you maybe can't optimize for Core2 for compatibility reasons, it is certainly safe to enable use of SSE and SSE2 in 32-bit i965. This optimization could perhaps be done.
There are indeed i965 chipsets supporting the Celeron M processor (and some motherboards may unofficially support Pentium 4 CPUs indeed). That processor does not have SSE3 and SSSE3 support which the Core 2 has. Probably it can be optimized for Pentium/Celeron M, then at least SSE2 would be enabled.
This is an interesting observation, I checked the manpage and searched for differences between -O2 and -O3, then thought about differences between C and C++.
Lets see what we can find there.-O3 Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops,
-fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and
-fipa-cp-clone options.
This affects C also, it looks like a function call is replaced by the function code. This should result in less stack usage but the function has to be so simple that creating a new stack entry costs more performance than executing the function. Seems to be relatively useless.-finline-functions
Integrate all simple functions into their callers. The compiler heuristically decides which functions are simple enough to be worth integrating in this way.
If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right.
Sounds more like the case for a warning that someone should write more efficient code. This is not C++ specific.-funswitch-loops
Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
I guess this also depends on the algorithms; it is pretty nice for Fibonacci numbers or heavy usage of the same memory-data and stuff like that. It could be possible that object oriented code gains something from that.-fpredictive-commoning
Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.
I have no idea what a load elimination pass or spilling is-fgcse-after-reload
When -fgcse-after-reload is enabled, a redundant load elimination pass is performed after reload. The purpose of this pass is to cleanup redundant spilling.
This basically modifies code for parallelization and is not C++ specific-ftree-vectorize
Perform loop vectorization on trees.
This sounds interesting regarding to C++ but I don't know if I understand it correctly: Let A and B be some classes, then A could call some ("externally visible" aka public?) methods b() of B so the compiler clones these methods b() from B (into A? Or what?). Sounds like the instanciation of A includes B-code in this case. If B is a static class then we would gain A.b() and don't need to call B.b() if my interpretation is right.-fipa-cp-clone
Perform function cloning to make interprocedural constant propagation stronger. When enabled, interprocedural constant propagation will perform function cloning when externally visible function can be called with constant arguments. Because this optimization can create multiple copies of functions, it may significantly increase code size (see --param ipcp-unit-growth=value)
I fail to see how a factor of 10 could be reached with this...? Maybe these fipa and commoning thingies work better than they sound. The performance gain seems to come from heavier memory usage.