[Ffmpeg-devel] VP3/Theora Perfection

Thu May 19 14:24:24 CEST 2005

Hi

On Thursday 19 May 2005 13:04, The Wanderer wrote:
> Michael Niedermayer wrote:
> > Hi
> >
> > On Thursday 19 May 2005 04:47, Mike Melanson wrote:
> >> Hi, I have replaced unpack_token() with a series of lookup tables
> >> in vp3.c. Now vp3data.h has more lines than vp3.c. Again, please
> >> test as I do not have great testing facilities right now. However,
> >> I did run a series of tests that validated a bunch of decoded
> >> tokens against the old function.
> >>
> >> Numbers for the speed freaks:
> >>
> >> [original]
> >> 1223 dezicycles in unpack_token, 32757 runs, 11 skips
> >> 1202 dezicycles in unpack_token, 65512 runs, 24 skips
> >> [new]
> >> 845 dezicycles in unpack_token, 32735 runs, 33 skips
> >> 841 dezicycles in unpack_token, 65466 runs, 70 skips
> >
> > well, not here, after a cvs up unpack_dct_coeffs (which includes the
> > unpack_token()) speed droped by 20%, to exclude possible effects of
> > local changes i tried on a clean tree
> >
> > [original]
> > 47208165 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 46909636 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 47450793 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> >
> > [new]
> > 43178650 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 42991589 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 43081780 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
>
> Am I reading these wrong? It looks to me like the original spends about
> 9.4% more time in unpack_dct_coeffs than the new version does (that's
> (47208165/43178650) - 1 ~= .0933). I'm assuming that the "new" version
> is the one which was just committed, i.e. the one which you are saying
> is slower; if it takes fewer dezicycles, I'm not sure how that doesn't
> mean it's faster instead. (If this assumption is invalid, I'd be
> interested to know how it makes sense to label the different versions
> that way...)

the new (r1.58) version is faster in this test with a clean tree, i didnt post 
the scores of my dev tree because i didnt save them


>
> Similarly, with your different-cflags version:
> > [original]
> > 41514189 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 41710143 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 41758835 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> >
> > [new]
> > 43992551 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 44276594 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
> > 43972657 dezicycles in unpack_dct_coeffs, 64 runs, 0 skips
>
> Here it looks to me like the "original" version spends about 5.5% *less*
> time in unpack_dct_coeffs than the "new" one does (that's 1 - (4151489 /
> 43992551) ~= .056), i.e., the "new" one is slower. Like the above, this
> is exactly the reverse of what you're saying; is my brain just totally
> screwed up here, or is something else going on?

hmm, ill try explain it again, no doubt my first try is a little strangely 
written

my dev tree got slower after a cvs up, i dont have the benchmark scores any 
more, and my dev tree changed since then so i cant rerun it easily

testing the r1.57 -> r1.58 change on a clean tree shows that the new version 
is faster (see my first benchmark) but if -finline-limit=2000 is added then 
r1.57 is faster (see my second benchmark), its also faster then r1.58 without 
-finline-limit=2000

the new code is also significantly smaller then the old as it replaces a 
"large" switch with look up tables

from all that evidence i conclude that gcc didnt inline unpack_token() or 
unpack_vlcs() orginally and that the speed increase seen on a clean tree is 
not because the function is really faster but because it is inlined

-- 
Michael