Results 1 to 10 of 10

Thread: Optimized Binaries Provide Great Benefits For Intel Haswell

  1. #1
    Join Date
    Jan 2007
    Posts
    13,394

    Default Optimized Binaries Provide Great Benefits For Intel Haswell

    Phoronix: Optimized Binaries Provide Great Benefits For Intel Haswell

    Utilizing the core-avx2 CPU optimizations offered by the GCC 4.8 compiler can provide real benefits for the Intel Core i7 4770K processor and other new "Haswell" CPUs. For some computational workloads, the new Haswell instruction set extensions can offer tremendous speed-ups compared to what's offered by the previous-generation Ivy Bridge CPUs.

    http://www.phoronix.com/vr.php?view=18788

  2. #2
    Join Date
    Aug 2010
    Posts
    6

    Default

    I would have found much more useful a comparison between the settings commonly used in binary packages (typically just up to SSE2 enabled on 64bit binaries), and a fewer set of them. Perhaps nocona, corei7-avx and core-avx2, and some -O2 vs -O3. The current benchmarks don't reflect anything to the real world, other than compiler capabilities using the new instructions, but you won't really find some -march=nocona binaries out in the wild. Perhaps just a default -march setting used in Fedora as an addition would have been nice.

  3. #3
    Join Date
    Jan 2012
    Posts
    39

    Default Optimizations for non-full fat chips?

    I've got a SB era laptop with a Pentium B940--so no AVX--as well as an IVB era Pentium G2020 desktop--also no AVX nor AVX2.

    So, my question is, are any of the gcc optimizations relavant to my machines?

    The flags from /proc/cpuinfo for the B940 are: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm

    And for the G2020: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms

  4. #4

    Default

    Quote Originally Posted by willmore View Post
    I've got a SB era laptop with a Pentium B940--so no AVX--as well as an IVB era Pentium G2020 desktop--also no AVX nor AVX2.

    So, my question is, are any of the gcc optimizations relavant to my machines?

    The flags from /proc/cpuinfo for the B940 are: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm

    And for the G2020: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
    SSSE3 and SSE4 might be helpful, although software that uses it will usually detect its presence. "-O3" "-flto" and profile guided optimisation usually yield the best increases, but they all have stability issues (and PGO needs user intervention in addition). LTO is probably the most stable of these in that you can compile and entire system and should only need to disable it for 10-20 packages out of 100s. If you're going to compile something I'd start with:

    Code:
    -march=native -O2 -pipe
    Then add heavier options until something breaks whilst benchmarking its performance each time.

    If you're not using 64bit then you should as GCC defaults to -mfpmath=sse which should yield some increases on modern hardware for floating point math (and 64bit might gain some additional increases as well). Some modern CPUs don't even have hardware support for x87 math so they'll be hampered even more without this option.

  5. #5
    Join Date
    Apr 2009
    Posts
    483

    Default

    Interesting. Doesn't this really point into source based distros? I never thought they would make a big difference, but it feels like if you could recompile select bits of your Ubuntu machine, particularly with Haswell, you'd get much better performance with Haswell. But there is a lot of value in using pre-compiled packages.

    I never understood why Ubuntu, with its focus on simplicity, hasn't offered an option, in the packet manager, to right click on a package and recompile it for your processor.

    Cheers!

  6. #6
    Join Date
    Apr 2010
    Posts
    19

    Question

    I wonder if a kernel built with custom CFLAGS would also effectively change performance in other tests.

  7. #7
    Join Date
    Oct 2008
    Posts
    2,904

    Default

    Quote Originally Posted by mendieta View Post
    Interesting. Doesn't this really point into source based distros? I never thought they would make a big difference, but it feels like if you could recompile select bits of your Ubuntu machine, particularly with Haswell, you'd get much better performance with Haswell. But there is a lot of value in using pre-compiled packages.

    I never understood why Ubuntu, with its focus on simplicity, hasn't offered an option, in the packet manager, to right click on a package and recompile it for your processor.

    Cheers!
    It does, but most packages won't be that affected. Things like scientific benchmarks, image processing, matrix multiplication, etc. see huge speedups. Your average app probably won't see any at all.

  8. #8

    Default

    Quote Originally Posted by tkmorris View Post
    I wonder if a kernel built with custom CFLAGS would also effectively change performance in other tests.
    You can squeeze a little bit of extra performance in some games by compiling Mesa with more aggressive CFLAGS (at least with R600g). Only a few % but still significant. The kernel is a bit more risky, though, and may not be stable if you go too aggressive.

  9. #9
    Join Date
    Nov 2007
    Posts
    1,353

    Default

    I think the kernels makefile overrides most cflags anyway, so even if you set them they won't actually be used.

  10. #10
    Join Date
    Mar 2013
    Posts
    63

    Question

    It is only myself or there aren't "great benefits" overall? In fact, I see several performance regressions.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •