Hmmm fairly interesting benchmarks - but a bit predicable in the outcomes... Although I had heard that Os only took a 10% hit...
I was hoping that you would have tested more esoteric stuff like the so called "Graphite" optimisations ( -floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block ). Have always been too scared to try these on any applications on my Gentoo install (they are commented out in make.conf)
I have used the lto (delayed link time optimisations) with gcc 4.7.1/2 - while I have a list of stuff that falls back to no-lto, it's not unmanageable. Naturally doesn't appear to make much difference with day-to-day usage
Gcc vs llvm in Os, and mostly into compilation time can be very very interesting.
Could you please include binary size too, if you do more like this? Also, I'm guessing -Ofast doesn't work everywhere — hence no result in the PHP benchmark?
However there is a solution to this problem, profile-guided optimization. Of all the tests I've done over the past two years I can't recall one situation where -O3 with PGO did not outperform or in the worst case scenario match any of the lower optimization levels.
Obviously this is because the profile data gives the compiler runtime information (hot/cold codepaths, cache usage, loop iterations etc) from which to determine when and where to apply optimizations which is a huge benefit compared to making 'educated guesses' at compile time.
We are seeing work being done on using LTO (link time optimization) when compiling the kernel which could potentially yield slightly better performance, mainly because code tends to become quite a bit smaller with this optimization which could decrease cache trashing, but also because it allows the compiler to view the entire source code as 'one entity' which likely opens up possibilities in optimizations like code reorganizing/reuse and of course dead code removal.
There are also kernel-specific LTO things that could be done, as the guy that posted his paper in the kernel LTO topic said.
His thesis was on LTOing a 2.4 kernel, but it did more than just remove dead code: it moved executed-once code to the .init section, saving runtime RAM for example.
Note that there might be some rare instances in which it does floating point arithmetic, but the kernel developers are quite adamant about avoiding it. Using it would have performance penalties. Furthermore, if it does use floating point arithmetic in those instances, -ffast-math could be a great way to break that code, possibly causing kernel panics.
By the way, if you want a faster computer, I suggest using ZFS. I am running Gentoo Linux on a ZFS rootfs on my desktop and it is virtually lag free. ZFS has its own IO elevator, so there is no need for the BFQ. Furthermore, I am using the CFS with the autogroups. I have found no need for BFS.
Last edited by ryao; 10-14-2012 at 08:32 AM.
if you use -march=native then the compiler will know things like the cache sizes. this means -O3 can make better decisions about speed/size trade off.
i remember (but cant find) and article saying that the large and clever caches in modern CPUs make -Os less useful.Code:gcc -march=native -E -v - </dev/null 2>&1 | grep cc1
Also with the -Ofast did you check the correctness of the programs. it will do things like turn 'x/100' into 'x*0.01', sometimes this is harmless, but some algorithms are very sensitive to this. ( http://gcc.godbolt.org/ is quite good for seeing what an optimisation will actually do )