Page 4 of 7 FirstFirst ... 23456 ... LastLast
Results 31 to 40 of 63

Thread: Compiler Benchmarks Of GCC, LLVM-GCC, DragonEgg, Clang

  1. #31
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by yotambien View Post
    Right. As I said, those benchmarks simply disagree with the idea that -O3 optimisations will always be at least as fast as -O2. To me, those numbers show that a) differences between -O2 and -O3 are minor; b) -O3 does not consistently produce the fastest binary. Of course, your experience is your experience, which is as valid as those tests.

    What sort of differences do you get with Mame and Handbrake (I guess you mean x264 in this case)?
    I don't know if I've kept any benchmark numbers for -O2 vs -O3 (I'll have to check), since I'm more interested in differences between -03 with or without explicit (non -Ox) optmizations like LTO and PGO etc. But since I was going to do some benchmarking on Mame soon anyways I'll make some -O2/O3 comparisons later this evening on it and post the results here. Just making dinner as we speak

  2. #32
    Join Date
    Oct 2008
    Posts
    2,904

    Default

    I suspect those tests where O2 outperformed O3 aren't very realistic. They probably have very small code bases that happen to fit into L1 with O2 and get enlarged a bit to only fit in the L2 cache with O3 optimizations, or something like that. Something that i imagine is mostly only true for microbenchmarks rather than a real application.

    Anyway, I think Michael isn't actually setting anything at all. If i remember correctly from the last compiler benchmarks he did, he's just running make without changing any of the default compiler settings from upstream.

  3. #33
    Join Date
    Oct 2008
    Posts
    2,904

    Default

    I think the compilers should be bootstapped for the compile-time benchmarks. It's not very realistic to compile everything with GCC4.4 system compiler, on a real system it would be using a self-built version that might (or might not) be able to compile programs faster.

  4. #34
    Join Date
    Oct 2009
    Posts
    845

    Default

    Ok here are the results:

    Test system: GCC 4.5.1, Arch Linux 2.6.35 64bit, Core i5
    Program: Mame 1.40
    mame commandline options: -noautoframeskip -frameskip 0 -skip_gameinfo -effect none -nowaitvsync -nothrottle -nosleep -window -mt -str 60

    -O2 -march=native -mfpmath=sse -msse4.2 -ffast-math
    cyber commando 209.14%
    cyber sled 123.52%
    radikal bikers 169.88%
    star gladiator 396.43%
    virtua fighter kids 185.24%

    -O3 -march=native -mfpmath=sse -msse4.2 -ffast-math
    cyber commando 213.44%
    cyber sled 124.71%
    radikal bikers 172.49%
    star gladiator 384.40%
    virtua fighter kids 187.20%

    Same as above (-O3 etc) but with PGO which automatically enables -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops, -ftracer.
    cyber commando 218.23%
    cyber sled 151.83%
    radikal bikers 186.45%
    star gladiator 406.21%
    virtua fighter kids 221.93%

    As much as I hate to admit it your (yotambien's) comment does have some credility in these results since even though -O2 only won in one test (thus an anomaly) it was the test with the biggest difference between -O2 and -O3.

    Other than that, PGO (profile guided optimization) shows that it can increase performance very nicely, I hope LLVM get's this optimization aswell soon. Next time I do a mame benchmark I will do a PGO test with -O2 aswell to see what the results are (particularly star gladiator). Also I will use a larger testcase which may show other instances where -O2 beats -O3.

  5. #35
    Join Date
    Jan 2008
    Location
    Have a good day.
    Posts
    678

    Default

    That's interesting. What are the percentages? I mean, I suppose higher is better, but what are they? : D

    On the other hand, the PGO thingy looks like it actually makes a nice difference...

  6. #36
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by yotambien View Post
    That's interesting. What are the percentages? I mean, I suppose higher is better, but what are they? : D

    On the other hand, the PGO thingy looks like it actually makes a nice difference...
    Thanks for not rubbing it in ;D The percentages are relative to the game running in full speed (as in 100%), so in all these tests the emulated games run faster than what they should (-nothrottle makes it run as fast as it can). And yes PGO does make a difference in cpu intensive programs, the one standout here is Virtua Fighter Kids which only differs from the other games in that it's cpu emulation is done through a dynamic recompiler so obviously that benefits alot from some of the things PGO improves upon, like better branch prediction, loop unrolling, less cache trashing etc.

  7. #37
    Join Date
    Jan 2008
    Posts
    772

    Default

    The Cyber Sled results are impressive; System 21 is a beast. Which Core i5 model is that, and how are you clocking it?

  8. #38
    Join Date
    Oct 2009
    Posts
    845

    Default

    Quote Originally Posted by Ex-Cyber View Post
    The Cyber Sled results are impressive; System 21 is a beast. Which Core i5 model is that, and how are you clocking it?
    Err.. how do I check model? cat /proc/cpuinfo only returns Core i5, no particular model as I can see. It's overclocked to 3.2ghz (original 2.67ghz).

  9. #39
    Join Date
    Dec 2008
    Posts
    980

    Default

    Great article. IMHO more important than the benchmark results are the rather frequent occurrences where Clang/LLVM failed to compile something. There's a lot of talk out there how Clang/LLVM supposedly be better than GCC. Rather than some theoretical talk, this article brings some hard facts to the table: Clang/LLVM still fails miserably in what it's supposed to do, and where it does succeed the resulting binaries are often slower than GCC produced binaries.

  10. #40
    Join Date
    Aug 2008
    Location
    Finland
    Posts
    1,567

    Default

    Quote Originally Posted by smitty3268 View Post
    I suspect those tests where O2 outperformed O3 aren't very realistic. They probably have very small code bases that happen to fit into L1 with O2 and get enlarged a bit to only fit in the L2 cache with O3 optimizations, or something like that. Something that i imagine is mostly only true for microbenchmarks rather than a real application.
    Depends. Seems certain optimizations in Mesa drivers constituted of making structures smaller so they fit in caches. Caches are really significant in modern computing, hence why -Os is sometimes wicked fast even though it has even less optimizations meant for speed than -O2.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •