Results 1 to 7 of 7

Thread: Statistical Significance In Benchmark Results

Hybrid View

  1. #1
    Join Date
    Jan 2007
    Posts
    15,644

    Default Statistical Significance In Benchmark Results

    Phoronix: Statistical Significance In Benchmark Results

    For those of you following the developments of Phoronix Test Suite 2.2 (codenamed "Bardu"), some new benchmarking features were pushed into its Git tree this week. The latest Phoronix Test Suite 2.2 code now has better FreeBSD 8.0 compatibility and support for network proxies with network communication, but larger than that is new support for ensuring test results are statistically significant. When any test profile is set to run multiple times, the Phoronix Test Suite is now capable of computing the standard deviation between each of the test runs...

    http://www.phoronix.com/vr.php?view=NzU2MA

  2. #2
    Join Date
    Jun 2008
    Posts
    86

    Default

    Excellent addition. Will a feature also be added to put error bars on the graphs, so the final standard deviation is visible on the charts?

  3. #3

    Default

    Quote Originally Posted by chaos386 View Post
    Excellent addition. Will a feature also be added to put error bars on the graphs, so the final standard deviation is visible on the charts?
    You can view the spread right now (and for the past months) using "phoronix-test-suite analyze-all-runs <result>". Though building into the Adobe SWF/Flash renderer I may end up writing support so that the different information is built into the graph itself and can be displayed on mouse-over or when clicking a button or something else, such as for when results are displayed on Phoronix.com.

  4. #4
    Join Date
    Sep 2009
    Posts
    1

    Default

    but larger than that is new support for ensuring test results are statistically significant. When any test profile is set to run multiple times, the Phoronix Test Suite is now capable of computing the standard deviation between each of the test runs...
    I just registered for these forums so I could say: "Thank you!". This can add some real meaning to the Phoronix test results, rather than only giving a feel of what might be going on.

    One thing to be careful of when increasing the number of runs is the difference between statistical significance and practical significance. Given enough runs, every comparison will become statistically significant - but a statistically significant difference of 0.5% is of no practical significance (there's usually not much point in scoring a "win" for an application or device by such a small amount, even if it is a real difference).

    Anyway, I'll say thanks again. Winner of best feature award for sure.

  5. #5
    Join Date
    Sep 2008
    Posts
    201

    Default

    Hi Michael.

    Have you considered adding some kind of ANOVA function to the PTS? Having a confidence interval (95% or somthing) on each graph would be very useful I think.

    For example in the BFS article, while you imply that BFS is faster for PHP compilation, I suspect that the difference is statistically insignificant, and BFS cannot really be said to be faster with any reasonable confidence.

    For an example of the sort of analysis I mean, see here.

  6. #6

    Default

    Thanks. great addition.

  7. #7
    Join Date
    Sep 2009
    Posts
    415

    Default

    Thanks for the info!

    As you probably know I'm new to this forum, so please excuse if the following has been covered or is out of context.

    What I want to know is what safe guards are in place to make sure that the latest Turbo Boost based processors are loaded to the point that thermal throttling is discovered. It is my position that tricking out a system for maximum performance is al well and good but if those benchmark numbers don't translate into valid figures for normal implementations of a chip then you haven't done your readers much of a favor.

    So lets say your bench mark runs a series of video encoding tasks, which ought to load the processor across all cores. Now initially for a small file this may not impact the chip to the point that thermal throttling is noticeable. But what happens if the we have something less than a high performance cooler and sub optimal thermal conditions, something that reflects most home based systems?

    I ask this because of the Intel based rebuttal to you earlier Lynnfield tests. I'm certain that the BIOS issue was real, after all this is brand new product, but I have to wonder about the differences in the results which really don't make sense. It makes me wonder if the processors might have been sitting under a huge air conditioner as this would likely keep the cores running at higher clock rates.

    I bring this up because we really haven't had processor quite like this on the market in the past. Thus it is hard to offer up a clear picture of what one can expect out of Lynnfield given non optimal conditions. The sad thing is we are talking big difference in performance based on how well the chip can cool itself over time. So it would make sense to test a given processor with a variety of heat removal capabilities to see just how much of a regression we will see with those different heat sinks. A simple question might be how long does Intels stock cooler allow a Lynnfield to benefit from Turbo Boost over the course of a long video encoding, in a room free of air conditioning.

    As you can see I'm puzzled by what sort of performance a person making an average investment in Lynnfield would get. Even with Intels tests, which in some cases I find bogus, it looks like an AMD chip works just as well.


    Dave

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •