[MPlayer-dev-eng] Re: Compile options

Mon Sep 25 16:07:14 CEST 2006

On Sat, 23 Sep 2006, Diego Biurrun wrote:
> On Sat, Sep 23, 2006 at 05:50:01PM +0300, Uoti Urpala wrote:
> > On Sat, 2006-09-23 at 07:17 -0700, Trent Piepho wrote:
> > > Mean and sample standard deviation for each optimization level, 10 samples per test.
> >
> > I think minimum would be a more appropriate value than mean, as the
> > decoding should be mostly deterministic and larger times represent
> > interference from other processes.
>
> I was about to say something similar.
>
> What's wrong with taking - say - the best of x runs?  The process is
> supposed to be deterministic ...

If that was a good idea, there would be popular statistical tests based on
it.  However, if there is any merit to the concept of comparing extrema, it
is not something I was taught.  Quite the opposite really; how to keep
extrema from producing a result that isn't justified.

As an example of why best of x runs is bad, consider the actual data from
the mpeg4 test of -O4 with -fno-inline-functions from dsputil_mmx.c vs
-fno-inline-functions for everything.

Using established statistical techniques, I concluded that there is no
significant difference that can be detected with the available data.

If we look at the first 3 data points:
O4.noif_dsp	12.814 12.750 12.703
O4.noif_all	12.821 12.786 12.764

It looks like noif_dsp is the best, with a minimum of 12.703 vs 12.764.  Of
course, we have no idea what the confidence level is for that value.
That's one of the reason why "best of x" isn't a valid method.  It just
doesn't have the mathematical basis that real methods have.  The p-value and
confidence interval from Student's t test isn't just something William
Gosset made up, they are mathematical truths like the value of pi.

Anyway, we've already decided that noif_dsp is the best, but why not do
some more runs?

O4.noif_dsp	12.755 12.661 12.764
O4.noif_all	12.768 12.633 12.635

Look at that, now noif_all has the best run with 12.633!  Before we
concluded noif_dsp was the best, now we conclude the opposite.  How about
looking at the rest of the 10 runs?

O4.noif_dsp	12.631 12.628 12.685 12.796
O4.noif_all	12.782 12.633 12.785 12.798

Now noif_dsp is best again with 12.628.  First we get one answer, then the
other, and then the original again!  If I did another 10 runs, what would
be the best then?  How does one qualify the result, that after X runs this
is best, but after X+1 maybe it will be totally different?  It seems that
the conclusion from Student's t test, that there is no significant
difference, is the right one.  Looking at the minimum is trying to find an
answer that isn't there.