[MPlayer-dev-eng] Analyzing benchmark results (was Re: Compile options)

Sat Sep 30 15:06:36 CEST 2006

On Mon, 2006-09-25 at 07:07 -0700, Trent Piepho wrote:
> On Sat, 23 Sep 2006, Diego Biurrun wrote:
> > On Sat, Sep 23, 2006 at 05:50:01PM +0300, Uoti Urpala wrote:
> > > On Sat, 2006-09-23 at 07:17 -0700, Trent Piepho wrote:
> > > > Mean and sample standard deviation for each optimization level, 10 samples per test.
> > >
> > > I think minimum would be a more appropriate value than mean, as the
> > > decoding should be mostly deterministic and larger times represent
> > > interference from other processes.
> >
> > I was about to say something similar.
> >
> > What's wrong with taking - say - the best of x runs?  The process is
> > supposed to be deterministic ...
> 
> If that was a good idea, there would be popular statistical tests based on
> it.  However, if there is any merit to the concept of comparing extrema, it
> is not something I was taught.  Quite the opposite really; how to keep
> extrema from producing a result that isn't justified.

In this case there cannot be far outlier values on the "too small" side
unless your testing method is broken. On the other hand there most
likely are errors which are not independent between tests and affect a
different portion of tests for each case. Since the tested value itself
is supposed to be (almost) deterministic, using minimum is more robust:
getting one single "good enough" sample means the test gives the right
result, whereas with mean you'd need to ensure that most samples have no
significant systematic bias.

> It looks like noif_dsp is the best, with a minimum of 12.703 vs 12.764.  Of
> course, we have no idea what the confidence level is for that value.
> That's one of the reason why "best of x" isn't a valid method.  It just

"Not a valid method" in what sense? If you claim that for any testing
method it always gives worse results you're wrong. It's easy to set up
an example test case (distribution of right values and possible errors)
where you can mathematically prove that minimum gives better results
than mean; if you understand this stuff at all you should be able to do
that yourself.

> doesn't have the mathematical basis that real methods have.  The p-value and
> confidence interval from Student's t test isn't just something William
> Gosset made up, they are mathematical truths like the value of pi.

They're mathematical truths about something, but not something that is
applicable to or the best method for everything that happens to involve
measurements with error.

> Now noif_dsp is best again with 12.628.  First we get one answer, then the
> other, and then the original again!  If I did another 10 runs, what would
> be the best then?  How does one qualify the result, that after X runs this
> is best, but after X+1 maybe it will be totally different?  It seems that
> the conclusion from Student's t test, that there is no significant
> difference, is the right one.  Looking at the minimum is trying to find an
> answer that isn't there.

This argument is nonsense. If the values are close then which one has
the smaller mean can also keep changing. Whether the difference is
significant is another question. You seem to set up an idiotic strawman
argument saying that any difference in the minimums would have to be
interpreted as a significant difference.