[FFmpeg-devel] IDCT permutation (was: pre discussion around Blackfin dct_quantize_bfin routine)

Thu Jun 14 11:51:31 CEST 2007

On Wednesday 13 June 2007 21:14, Michael Niedermayer wrote:
> Hi
>
> On Wed, Jun 13, 2007 at 10:57:26AM +0300, Siarhei Siamashka wrote:
> [...]
>
> > > also decoding involes a mandatory permutation
> > > so no matter what idct_permutation is set to it will be the same speed
> > > and wisely setting the idct permutation can simplify the idct and thus
> > > speed it up, this is a high level optimization and wont make code
> > > slower no matter how expensive the permutation is as there arent more
> > > permutations done
> >
> > By looking at ffmpeg code, this does not seem to be absolutely true...
> >
> > > the extra cost is just on the encoder side, where its just a single
> > > if() if its the no permutation case ...
> >
> > Please check the patch which is attached. It was generated by a ruby
> > script which is also attached:
>
> [...]
>
> > Before patch:
> >
> > $ ./mplayer.orig  -nosound -quiet -benchmark -vo null -loop
> > 3 /media/mmc1/Video/MissionImpossible3_Trailer4.divx | grep BENCHMARKs
> > BENCHMARKs: VC:  89.976s VO:   0.034s A:   0.000s Sys:   1.089s =  
> > 91.098s BENCHMARKs: VC:  93.419s VO:   0.033s A:   0.000s Sys:   1.069s =
> >   94.521s BENCHMARKs: VC:  93.307s VO:   0.032s A:   0.000s Sys:   1.078s
> > =   94.418s
> >
> > After patch:
> >
> > ~ $ ./mplayer.patched -nosound -quiet -benchmark -vo null -loop
> > 3 /media/mmc1/Video/MissionImpossible3_Trailer4.divx | grep BENCHMARKs
> > BENCHMARKs: VC:  87.998s VO:   0.036s A:   0.000s Sys:   1.086s =  
> > 89.120s BENCHMARKs: VC:  91.074s VO:   0.035s A:   0.000s Sys:   1.069s =
> >   92.177s BENCHMARKs: VC:  91.377s VO:   0.036s A:   0.000s Sys:   1.069s
> > =   92.482s
>
> i do not belive that the changes in the patch (95% of them in very rarely
> executed init code) directly caused this difference
> the only part which might have caused it is the ac prediction, 

This all makes it very interesting. The difference on x86 is barely
noticeable (tested on athlon xp). I wonder what makes it cause such 
an effect on ARM? Maybe it could be data cache (16K only) getting 
thrashed heavily on video decoding and causing cache misses on 
doing permutation table lookups. Or probably removing table lookup 
makes the code simplier and optimizer is suddenly capable to allocate
registers better all over the function resulting in better performance... 
This all can, and probably needs to be verified. At least simulations
with callgrind can provide some insights on cache usage and statistics 
about overall number of premutation table lookups done and the places 
where they occur most (that permutation lookup macro can be replaced 
with a function for testing purposes).

On a related side, optimizing IDCT on ARM seems to have a much lower 
effect than on x86. Even removing IDCT completely (just for test) does not
affect performance much. Looks like a lot of time is spent in ac prediction
and other decoder parts. Investigating what happens around may provide
some interesting information about what can be improved.

> if this 
> really is that critical iam sure there are more sensibly ways to optimize
> it, keep in mind we are permutating a lot of zeros around and we know where
> the last non zero element is

> also your patch breaks half of the IDCTs if the permutation is ignored

It might make some sense to sacrifice these IDCTs to gain some more
performance (if the performance improvement is worth it) in some
configurations in practice. But I agree that a more generic solution would be
better.

Performing table lookup only to get the same value for non-permutating
IDCTs causes some overhead (whether it is big enough to worry or not is
another matter). To get other IDCTs working, it should be possible to do
permutation just before calling IDCT (yes, it would be less inefficient and
slower). This way the code would be more favourable for non-permutating IDCTs
and cause slowdown for all the others. Anyway, supporting all this is hardly
interesting for mainstream architectures such as x86. I just wonder how
blackfin or other simple pipeline processors behave in this respect, that's
why I posted this patch for test in blackfin discussion thread.