[MPlayer-dev-eng] Analyzing benchmark results (was Re: Compile options)

Uoti Urpala uoti.urpala at pp1.inet.fi
Sun Oct 1 05:23:52 CEST 2006


On Sat, 2006-09-23 at 07:17 -0700, Trent Piepho wrote:
> > > > > > Mean and sample standard deviation for each optimization level, 10 samples per test.

On Sat, Sep 23, 2006 at 05:50:01PM +0300, Uoti Urpala wrote:
> > > > > I think minimum would be a more appropriate value than mean, as the
> > > > > decoding should be mostly deterministic and larger times represent
> > > > > interference from other processes.

Note that my original remark was about the values shown as the speeds of
the optimization levels, rather than about estimating the accuracy of
the values.

On Sat, 30 Sep 2006, Uoti Urpala wrote:
> > In this case there cannot be far outlier values on the "too small" side
> > unless your testing method is broken. On the other hand there most

On Sat, 2006-09-30 at 16:06 -0700, Trent Piepho wrote:
> Think of the measured values as being the true value plus some random error
> term.  This random error may be always positive and not have a mean of
> zero.  That doesn't matter.  Some tests require that the resulting data
> have a normal distribution (Student's t), some don't (Wilcoxon rank-sum).

Just knowing whether one speed is likely higher isn't enough really, you
want to know what the speed value actually (approximately) is. So the
goal is normally not to run a test to determine whether you can say that
A is faster than B, but to produce estimates for the speeds of A and B.
Also even if you're just trying to estimate whether one speed is higher
that still doesn't mean something like the rank-sum test would be best:
it might not require any particular distribution to not show a
difference where one doesn't exist, but the opposite is not true - you
can get more accurate results by using extra information.

> > likely are errors which are not independent between tests and affect a
> > different portion of tests for each case. Since the tested value itself
> 
> Why would the errors not be independent between tests?  And if it wasn't,
> there would be some auto-correlation between tests, would there not?  I'll
> attache the H.264 data, tell me where the auto-correlation is.  I didn't
> see any.

The activity of other processes on the computer is quite likely to have
trends which are not independent between tests.

> > is supposed to be (almost) deterministic, using minimum is more robust:
> 
> There are plenty of other testing applications when the true value of the
> variable being measured doesn't change between samples, and any errors are
> the result of measurment error.  Benchmarking a computer program isn't some
> kind of unique scenerio that requires one to throw out the previous century
> of statistical thought.  Can you find any literature that suggests that
> using minimum or maximum is a better test?

Benchmarking a computer program might not be a unique scenerio; however
it IS different from the area of applicability of most statistical
tests. You could say it's not so much "requires one to throw out the
previous century of statistical thought" but the kind of scenario where
people haven't tried to apply the statistical methods during the last
century either. There are at least 2 significant factors:

1) Systematic errors and unexpected distribution of random errors in
testing are likely.

2) Even if you're extremely careful in testing the best statistics can
tell you is something like "option A is likely faster if you use the
code from this day and hour in svn, compiled with this exact compiler
version and linked with this exact linker version, running under this
particular CPU, this version of this OS with these other programs
running". Once you extrapolate from that to something which actually
matters the uncertainties in the extrapolation are more important than
any fine statistical details.

1) means that robustness is a major factor, and 2) means small
differences don't matter even if you could detect them more accurately.
Minimum is quite robust, even against extreme things like someone
hostilely setting 2/3 of the samples to arbitrary values meant to
confuse your statistics as badly as possible (still with the constraint
that it's not lower than the true value of course), and gives a better
estimate of the "true" value than any other simple method such as mean.

> > getting one single "good enough" sample means the test gives the right
> 
> How do you decide when you have the magic sample that you will use while
> throwing all other data out?

After a few dozen runs, maybe checking whether the lowest value is
isolated or whether there are several close ones. Seriously, trying to
be more exact than that is misguided: errors from reasons other than bad
luck with the samples are much more likely.

> Why don't you?  What are the instances when "best of x", where x is some
> number you came up with before hand, is mathmatically better than
> established tests such as Student's t test or Wilcoxon Rank-Sum?  It seems

Your exact question here isn't too clear since "best of x" produces an
estimate for the true value, whereas those tests do not produce an
estimate for the value but for the probability of one distribution
having a higher mean than another. Anyway a distribution where the
measurement is true value + one of [0, 1, 2, 3, 4] with equal
probability is an example where taking minimum first is better for most
tests as it'll very likely give you the exact true value once you have a
reasonable number of samples.

> that if "best of x" was a good idea, at least for some real scenario, there
> would be papers published on it and references to the test.  I didn't find
> any.  Do you know of any?

I don't know of papers, and wouldn't expect any to be published (at
least about the case of using it with computer programs). There isn't
that much to say - there isn't some hidden feature of the data that
could be detected with a clever test, rather the point is that using any
sophisticated analysis tools is likely a mistake.

> > > Now noif_dsp is best again with 12.628.  First we get one answer, then the
> > > other, and then the original again!  If I did another 10 runs, what would
> > > be the best then?  How does one qualify the result, that after X runs this
> > > is best, but after X+1 maybe it will be totally different?  It seems that
> > > the conclusion from Student's t test, that there is no significant
> > > difference, is the right one.  Looking at the minimum is trying to find an
> > > answer that isn't there.
> >
> > This argument is nonsense. If the values are close then which one has
> > the smaller mean can also keep changing. Whether the difference is
> > significant is another question. You seem to set up an idiotic strawman
> 
> It isn't another question, it is _the_ question.  Does the data support a
> conclusion that there is a difference between two populations?  It's the
> classic question of hypothesis testing.

As mentioned above that is not the question you're normally interested
in when doing benchmarks. You want to know how fast the speeds are (at
least relative to each other), not whether one is higher than the other
(by some unknown amount). 

> > argument saying that any difference in the minimums would have to be
> > interpreted as a significant difference.
> 
> So, how does one decide if the difference in minimums is significant?  You
> take the minimum of the two samples and then what?

If the difference in values is greater than what you consider
significant performance differences. Worrying about random errors in
testing would be a mistake; peculiarities in the "true" values specific
to your current platform or the exact svn version from this day are
bigger.




More information about the MPlayer-dev-eng mailing list