[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Sun Aug 31 12:18:14 CEST 2008

On Sunday 31 August 2008, Michael Niedermayer wrote:
> On Sat, Aug 30, 2008 at 11:42:31PM +0300, Siarhei Siamashka wrote:
> > On Saturday 30 August 2008, Loren Merritt wrote:
> > > On Sat, 30 Aug 2008, Siarhei Siamashka wrote:
> > > > This trivial patch improves overall vorbis decoding performance by
> > > > ~3% on Pentium-M with gcc 4.2.3
> > >
> > > vorbis_residue_decode_type# are superfluous. Just inline
> > > vorbis_residue_decode_internal into vorbis_residue_decode.
> >
> > Theoretically they are superfluous (inlining
> > vorbis_residue_decode_internal into vorbis_residue_decode was the first
> > thing that I tried). But in practice code is consistently faster this
> > way. Probably it is easier for gcc to optimize 3 independent functions
> > than everything bundled into a huge one. Let me know if you get different
> > results.
>
> well, I do
>
> [...]
>
> > --------------------
> > callgrind simulation for './ffmpeg_g.1huge' (L1 data cache is 32K):
> > I   refs:      85,817,091
> > D   refs:      43,457,905  (28,888,575 rd + 14,569,330 wr)
> > D1  misses:       785,564  (   583,645 rd +    201,919 wr)
> > D1  miss rate:        1.8% (       2.0%   +        1.3%  )
> > callgrind simulation for './ffmpeg_g.3func' (L1 data cache is 32K):
> > I   refs:      85,085,997
> > D   refs:      42,653,212  (28,454,961 rd + 14,198,251 wr)
> > D1  misses:       782,978  (   581,685 rd +    201,293 wr)
> > D1  miss rate:        1.8% (       2.0%   +        1.4%  )
> >
> > The difference is visible both for the total number of instructions and
> > for the number of memory accesses.
>
> loren:
> I   refs:      5,663,789,738
> I1  misses:        3,515,218
> I1  miss rate:          0.06%
> D   refs:      1,889,318,408  (1,365,757,445 rd   + 523,560,963 wr)
> D1  misses:       32,073,499  (   22,443,938 rd   +   9,629,561 wr)
> D1  miss rate:           1.6% (          1.6%     +         1.8%  )
>
> siar:
> I   refs:      5,670,795,747
> I1  misses:        3,488,120
> I1  miss rate:          0.06%
> D   refs:      1,896,279,210  (1,372,731,243 rd   + 523,547,967 wr)
> D1  misses:       32,096,476  (   22,464,805 rd   +   9,631,671 wr)
> D1  miss rate:           1.6% (          1.6%     +         1.8%  )

Took time to compile/install gcc 4.3.2 and also got similar results. What's 
more important, the fastest build generated by gcc 4.3.2 (all inlined) was 
better than the fastest build generated by 4.2.3 (dummy functions). This 
really makes the choice quite obvious :)

> Ill commit the clean version without the dummy functions in a day or 2
> unless someone objects / has some idea of how to improve it.

I also tried to benchmark the variants where 'vlen' is also inlined as 
constants 128 and 1024 which are quite typical (with the hope that it could
save 1 extra register for gcc in the inner loop) but effect on the 
performance was minimal.

Regarding 'vorbis_residue_decode' function, it probably makes sense to
optimize these loops: 

if(dim==2) {
    for(k=0;k<step;++k) {
        coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 2;
        vec[voffs+k     ]+=codebook.codevectors[coffs  ];  // FPMATH
        vec[voffs+k+vlen]+=codebook.codevectors[coffs+1];  // FPMATH
    }
} else if(dim==4) {
    for(k=0;k<step;++k, voffs+=2) {
        coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 4;
        vec[voffs       ]+=codebook.codevectors[coffs  ];  // FPMATH
        vec[voffs+1     ]+=codebook.codevectors[coffs+2];  // FPMATH
        vec[voffs+vlen  ]+=codebook.codevectors[coffs+1];  // FPMATH
        vec[voffs+vlen+1]+=codebook.codevectors[coffs+3];  // FPMATH
    }
} ...

'get_vlc2' call could be replaced with some GET_VLC/GET_RL_VLC variant 
so that the number of intermediate excessive UPDATE_CACHE operations is
minimized.

-- 
Best regards,
Siarhei Siamashka