[MPlayer-dev-eng] Analyzing benchmark results (was Re: Compile options)

Sun Oct 1 01:06:39 CEST 2006

On Sat, 30 Sep 2006, Uoti Urpala wrote:
> On Mon, 2006-09-25 at 07:07 -0700, Trent Piepho wrote:
> > On Sat, 23 Sep 2006, Diego Biurrun wrote:
> > > On Sat, Sep 23, 2006 at 05:50:01PM +0300, Uoti Urpala wrote:
> > > > On Sat, 2006-09-23 at 07:17 -0700, Trent Piepho wrote:
> > > > > Mean and sample standard deviation for each optimization level, 10 samples per test.
> > > >
> > > > I think minimum would be a more appropriate value than mean, as the
> > > > decoding should be mostly deterministic and larger times represent
> > > > interference from other processes.
> > >
> > > I was about to say something similar.
> > >
> > > What's wrong with taking - say - the best of x runs?  The process is
> > > supposed to be deterministic ...
> >
> > If that was a good idea, there would be popular statistical tests based on
> > it.  However, if there is any merit to the concept of comparing extrema, it
> > is not something I was taught.  Quite the opposite really; how to keep
> > extrema from producing a result that isn't justified.
>
> In this case there cannot be far outlier values on the "too small" side
> unless your testing method is broken. On the other hand there most

Think of the measured values as being the true value plus some random error
term.  This random error may be always positive and not have a mean of
zero.  That doesn't matter.  Some tests require that the resulting data
have a normal distribution (Student's t), some don't (Wilcoxon rank-sum).

> likely are errors which are not independent between tests and affect a
> different portion of tests for each case. Since the tested value itself

Why would the errors not be independent between tests?  And if it wasn't,
there would be some auto-correlation between tests, would there not?  I'll
attache the H.264 data, tell me where the auto-correlation is.  I didn't
see any.

> is supposed to be (almost) deterministic, using minimum is more robust:

There are plenty of other testing applications when the true value of the
variable being measured doesn't change between samples, and any errors are
the result of measurment error.  Benchmarking a computer program isn't some
kind of unique scenerio that requires one to throw out the previous century
of statistical thought.  Can you find any literature that suggests that
using minimum or maximum is a better test?

> getting one single "good enough" sample means the test gives the right

How do you decide when you have the magic sample that you will use while
throwing all other data out?

> > It looks like noif_dsp is the best, with a minimum of 12.703 vs 12.764.  Of
> > course, we have no idea what the confidence level is for that value.
> > That's one of the reason why "best of x" isn't a valid method.  It just
>
> "Not a valid method" in what sense? If you claim that for any testing
> method it always gives worse results you're wrong. It's easy to set up
> an example test case (distribution of right values and possible errors)
> where you can mathematically prove that minimum gives better results
> than mean; if you understand this stuff at all you should be able to do
> that yourself.

Why don't you?  What are the instances when "best of x", where x is some
number you came up with before hand, is mathmatically better than
established tests such as Student's t test or Wilcoxon Rank-Sum?  It seems
that if "best of x" was a good idea, at least for some real scenario, there
would be papers published on it and references to the test.  I didn't find
any.  Do you know of any?

> > Now noif_dsp is best again with 12.628.  First we get one answer, then the
> > other, and then the original again!  If I did another 10 runs, what would
> > be the best then?  How does one qualify the result, that after X runs this
> > is best, but after X+1 maybe it will be totally different?  It seems that
> > the conclusion from Student's t test, that there is no significant
> > difference, is the right one.  Looking at the minimum is trying to find an
> > answer that isn't there.
>
> This argument is nonsense. If the values are close then which one has
> the smaller mean can also keep changing. Whether the difference is
> significant is another question. You seem to set up an idiotic strawman

It isn't another question, it is _the_ question.  Does the data support a
conclusion that there is a difference between two populations?  It's the
classic question of hypothesis testing.

> argument saying that any difference in the minimums would have to be
> interpreted as a significant difference.

So, how does one decide if the difference in minimums is significant?  You
take the minimum of the two samples and then what?
-------------- next part --------------
                    vc    vo   sys   user elapsed        type dspin restin Olevel
O2.1           133.901 0.028 0.786 134.51  135.50          O2 FALSE  FALSE      2
O2.2           131.908 0.030 0.780 132.61  133.19          O2 FALSE  FALSE      2
O2.3           131.415 0.029 0.782 132.15  132.30          O2 FALSE  FALSE      2
O2.4           131.547 0.029 0.780 132.29  132.43          O2 FALSE  FALSE      2
O2.5           131.258 0.029 0.777 131.98  132.53          O2 FALSE  FALSE      2
O2.6           133.090 0.028 0.784 133.81  133.97          O2 FALSE  FALSE      2
O2.7           133.221 0.029 0.779 133.94  134.10          O2 FALSE  FALSE      2
O2.8           133.839 0.029 0.787 134.57  134.73          O2 FALSE  FALSE      2
O2.9           131.066 0.028 0.779 131.78  132.34          O2 FALSE  FALSE      2
O2.10          133.672 0.029 0.782 134.42  134.95          O2 FALSE  FALSE      2
O2.if_dsp.1    138.413 0.027 0.783 139.03  139.56   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.2    138.481 0.028 0.778 139.20  139.36   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.3    138.633 0.030 0.784 139.35  139.52   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.4    139.899 0.029 0.791 140.62  140.79   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.5    140.553 0.030 0.793 141.27  141.44   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.6    137.300 0.028 0.772 138.00  138.57   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.7    139.757 0.029 0.788 140.49  140.65   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.8    137.870 0.029 0.780 138.60  139.15   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.9    138.707 0.029 0.779 139.42  139.59   O2.if_dsp  TRUE  FALSE      2
O2.if_dsp.10   138.163 0.029 0.779 138.86  139.04   O2.if_dsp  TRUE  FALSE      2
O2.if_all.1    139.398 0.030 0.780 140.00  140.57   O2.if_all  TRUE   TRUE      2
O2.if_all.2    139.740 0.028 0.771 140.42  140.61   O2.if_all  TRUE   TRUE      2
O2.if_all.3    140.140 0.028 0.775 140.84  141.01   O2.if_all  TRUE   TRUE      2
O2.if_all.4    138.198 0.027 0.770 138.89  139.06   O2.if_all  TRUE   TRUE      2
O2.if_all.5    137.892 0.028 0.770 138.61  139.16   O2.if_all  TRUE   TRUE      2
O2.if_all.6    142.364 0.030 0.789 143.04  143.26   O2.if_all  TRUE   TRUE      2
O2.if_all.7    137.828 0.026 0.760 138.54  139.08   O2.if_all  TRUE   TRUE      2
O2.if_all.8    138.519 0.026 0.763 139.21  139.78   O2.if_all  TRUE   TRUE      2
O2.if_all.9    137.292 0.026 0.764 137.98  138.55   O2.if_all  TRUE   TRUE      2
O2.if_all.10   137.285 0.026 0.761 138.00  138.54   O2.if_all  TRUE   TRUE      2
O4.noif_all.1  131.068 0.030 0.788 131.67  132.62 O4.noif_all FALSE  FALSE      4
O4.noif_all.2  131.153 0.028 0.776 131.87  132.03 O4.noif_all FALSE  FALSE      4
O4.noif_all.3  131.349 0.029 0.778 132.06  132.23 O4.noif_all FALSE  FALSE      4
O4.noif_all.4  131.407 0.029 0.776 132.14  132.28 O4.noif_all FALSE  FALSE      4
O4.noif_all.5  130.997 0.028 0.777 131.71  131.87 O4.noif_all FALSE  FALSE      4
O4.noif_all.6  130.176 0.029 0.781 130.89  131.06 O4.noif_all FALSE  FALSE      4
O4.noif_all.7  130.092 0.029 0.780 130.81  130.97 O4.noif_all FALSE  FALSE      4
O4.noif_all.8  129.928 0.029 0.781 130.60  131.20 O4.noif_all FALSE  FALSE      4
O4.noif_all.9  130.951 0.029 0.772 131.65  131.83 O4.noif_all FALSE  FALSE      4
O4.noif_all.10 131.104 0.030 0.783 131.80  131.99 O4.noif_all FALSE  FALSE      4
O4.noif_dsp.1  130.287 0.029 0.775 130.99  131.16 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.2  133.132 0.030 0.789 133.86  134.02 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.3  133.144 0.030 0.785 133.86  134.43 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.4  128.599 0.027 0.753 129.29  129.85 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.5  132.835 0.030 0.788 133.54  133.73 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.6  128.641 0.027 0.748 129.32  129.49 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.7  128.922 0.028 0.752 129.61  130.17 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.8  130.248 0.029 0.777 130.96  131.52 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.9  133.366 0.030 0.784 134.09  134.25 O4.noif_dsp FALSE   TRUE      4
O4.noif_dsp.10 130.704 0.030 0.778 131.42  131.98 O4.noif_dsp FALSE   TRUE      4
O4.1           138.681 0.028 0.784 139.31  139.86          O4  TRUE   TRUE      4
O4.2           136.098 0.028 0.779 136.80  136.98          O4  TRUE   TRUE      4
O4.3           136.768 0.029 0.777 137.47  137.64          O4  TRUE   TRUE      4
O4.4           136.055 0.027 0.777 136.77  136.93          O4  TRUE   TRUE      4
O4.5           136.328 0.028 0.777 137.03  137.60          O4  TRUE   TRUE      4
O4.6           138.466 0.028 0.785 139.15  139.75          O4  TRUE   TRUE      4
O4.7           135.978 0.028 0.777 136.69  137.25          O4  TRUE   TRUE      4
O4.8           138.732 0.028 0.781 139.44  139.61          O4  TRUE   TRUE      4
O4.9           136.461 0.027 0.777 137.16  137.73          O4  TRUE   TRUE      4
O4.10          137.947 0.028 0.783 138.65  138.83          O4  TRUE   TRUE      4