Optimized Binaries Provide Great Benefits For Intel Haswell
Phoronix: Optimized Binaries Provide Great Benefits For Intel Haswell
Utilizing the core-avx2 CPU optimizations offered by the GCC 4.8 compiler can provide real benefits for the Intel Core i7 4770K processor and other new "Haswell" CPUs. For some computational workloads, the new Haswell instruction set extensions can offer tremendous speed-ups compared to what's offered by the previous-generation Ivy Bridge CPUs.
I would have found much more useful a comparison between the settings commonly used in binary packages (typically just up to SSE2 enabled on 64bit binaries), and a fewer set of them. Perhaps nocona, corei7-avx and core-avx2, and some -O2 vs -O3. The current benchmarks don't reflect anything to the real world, other than compiler capabilities using the new instructions, but you won't really find some -march=nocona binaries out in the wild. Perhaps just a default -march setting used in Fedora as an addition would have been nice.
Optimizations for non-full fat chips?
I've got a SB era laptop with a Pentium B940--so no AVX--as well as an IVB era Pentium G2020 desktop--also no AVX nor AVX2.
So, my question is, are any of the gcc optimizations relavant to my machines?
The flags from /proc/cpuinfo for the B940 are: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm
And for the G2020: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
SSSE3 and SSE4 might be helpful, although software that uses it will usually detect its presence. "-O3" "-flto" and profile guided optimisation usually yield the best increases, but they all have stability issues (and PGO needs user intervention in addition). LTO is probably the most stable of these in that you can compile and entire system and should only need to disable it for 10-20 packages out of 100s. If you're going to compile something I'd start with:
Originally Posted by willmore
Then add heavier options until something breaks whilst benchmarking its performance each time.
-march=native -O2 -pipe
If you're not using 64bit then you should as GCC defaults to -mfpmath=sse which should yield some increases on modern hardware for floating point math (and 64bit might gain some additional increases as well). Some modern CPUs don't even have hardware support for x87 math so they'll be hampered even more without this option.
Interesting. Doesn't this really point into source based distros? I never thought they would make a big difference, but it feels like if you could recompile select bits of your Ubuntu machine, particularly with Haswell, you'd get much better performance with Haswell. But there is a lot of value in using pre-compiled packages.
I never understood why Ubuntu, with its focus on simplicity, hasn't offered an option, in the packet manager, to right click on a package and recompile it for your processor.
I wonder if a kernel built with custom CFLAGS would also effectively change performance in other tests.
It does, but most packages won't be that affected. Things like scientific benchmarks, image processing, matrix multiplication, etc. see huge speedups. Your average app probably won't see any at all.
Originally Posted by mendieta