Which processor benefits the most from tuned compilations?
Most distros compile their code for the generic 386 or x64 architecture. I was wondering for some time to what extent a certain processor benefits from a tuned compilation. For example compile a program with the flag -mtune=386, benchmark it and then do the same with -mtune=athlon-xp and compare the results.
If processor A gains 30% performance on average with a tuned compilation while processor B gains 15% that would skew the price/performance ratio in favor of processor A, that is, if you're willing to recompile packages for your processor.
Forget it, you will never get the time back you need to compile those apps if you could get em precompiled. Maybe for little things like mplayer/ffmpeg you could play with.
Yup for a mass majority of the applications the recompiling yields little performance enhancements. There are a few that can have noticable increases, but those are few and inbetween (GMP and openSSL being one of the very few). To see gains across the board your better off looking at tuning of services, filesystems, etc instead. It's not only a matter of a simple recompile either to gain performance, the code itself has to accomodate the added enhancements as well.
I read somewhere that the i586 Pentiums benefited the most from the optimization for them. Can't remember why that was, something to do with the architecture.
Some famous programmer (Michael Abrash, I think) said that the original Pentium was the last mainstream CPU for which it was really worthwhile to hand-optimize assembly language code, because the new architectures are extremely complex and their low-level behavior is not publicly documented in the level of detail needed for clever optimization. IIRC, Pentium had a really barebones kind of superscalar pipelining, such that certain pairs of instructions could effectively execute in parallel, and so benefited a lot (compared to earlier and later architectures) from static reordering of instructions to maximize the proportion of appropriately paired instructions.
I remember last year or so when I had a P4 in my box, I benchmarked some of my own code (math and crypto stuff) and found that the tuned code was actually slower than generic 386 code ;-) I have an athlon xp now and will try a few benchmarks soon.
On another anecdote I switched from ubuntu to arch linux when I found that the ubuntu package of audacity, a sound editor, was horrendously slow. I couldn't work with it at all, so I thought what the heck is going on and compiled it myself. Now it was fast and snappy, although I didn't modify any compiler flags for the compilation. Then I looked at how the package from the ubuntu repositories was compiled and couldn't find any difference, so that still confuses me in a way. And I thought to myself hmmm I'll try arch which provides packages compiled for the 686 architecture to see if I can get another boost in overall performance and certainly things improved a little overall, most importantly firefox was a tad snappier. For an aging processor such as the athlon xp such boosts are probably more noticable than for newer cpus.
I don't think the comparison of the time spent on compiling things vs the gained speedup is valid, remember Gentoo? It may be valid for servers but homeuser PCs idle most of the time anyway. Certainly a recompilation is always an option for me whenever I feel the response time of my system is too slow.
There are two main things that are different between various x86 processors and that will affect speed:
- Instruction set: you want to use SSE2 and subsequent incarnations of SSE for some types of code (multimedia for instance); compiling for i686 won't help you here; FFmpeg is the typical example.
- Micro-architecture: for instance, as already told, the i586 was an in-order statically scheduled processor with complex rules for pairing; current CPUs (except the ATOM) are out of order and so are less dependent on good instruction scheduling.
The first point is probably the one that will make the biggest difference.
I read a few things (one here) that say the Athlon XP can use 3DNow and SSE at the same time, and IIRC this trick is used in ffmpeg.
I've tried doing the -funroll-loops thing once just out of curiosity and the only visible difference was more segfaults. "-O2 -march=native" is good enough for self-compiled things.
I remember when I was doing a student project, the P4's performance varied a lot more depending on what target you compile for. The Athlon-class cores, the P6 (PPro,P2,P3) and the Core2's all showed much less variance in performance.
But the point is, that I would stay away from CPU's where the performance is very compiler dependant.
Apparently this was the big failure of the Itanium, it was extremely sensitive to compilers...
Also, as a gentoo user, I know that the best generic compiler flags is merely -O2 -march=native (and disabling debug if you don't need it). Aggressive optimisations often end up giving worse performance, and benchmarks does not tell the story of latency, only throughput.
I also use -march=my-native-arch -O2 (and -pipe) (I use -Os on VIA CPU based systems), earlier I told the compiler expilcitely to use -mmmx -msse -m3dnow etc. since older compilers didn't do that automativally. I think the comparison of bins for x86 general and -march-things is not easy. It depends a lot on the compiler that is used, e.g. intel's own compilers tend to generate fastets code on their own CPU (oh, what a news) but the code then also tends to be huge.
If we all go by gcc, which should be most people's default, it probably depends if the binary distribution is compiled for i585, i686 or just i386 and compatible. I guess a plain works-everywhere i386 will be noticeably slower than a -march=your-cpu-type but comparing with an i686... well.
There was an article in a recent issue of German Linux user comparing intel's i7 (ist it called i7 or core 7 someting?) and AMD PhenomII. There were interesting results in the comparison concerning 32bit and 64bit machines code, AMD yielded a lot more in certain applications. Of course over all your system's packages the gain won't be that much (AMD also declared this on German Chemnitzer Linux Days 2 or 3 years ago) but cartain things will have a nice speedup while very few might even drop below x86_32 level in performance (I guess it was rar packager).
(FYI the AMD system won that test. Bwahaha, look at my signature. Actually the intel was as expected the winner by brutal performance but sucked up more power and... tada! it was for about 1000$ while the Phenom2 was for about 180$ so make up your own opinion about the two. Besides there's also a good part of chipset and RAM performance difference to calculate in, which will influence the measurements.)
Since I go with Gentoo I always use -march=something and thus won't have comparison but on Gentoo you also have the nice USE flags and so on.
But I was told by a guest on Chemnitzer Linux weekend that he was really surprised how fast my KDE (3.5.8 back at that time) would start up, and that was on a lousy VIA C3-2 1200MHz, 512M RAM (DDR1 iirc). So it had probably helped.
Besides all that I want to warn you people about too aggressive optimizations, and there are some packages in Gentoo which have custom flags disabled by default. Furthermore e.g. wine didn't like -Os to be compiled with so I switch to -O2 for this package on my VIA systems.