[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Wed Sep 3 05:23:02 CEST 2008

On Tue, Sep 02, 2008 at 06:03:15AM +0300, Siarhei Siamashka wrote:
> On Sunday 31 August 2008, Michael Niedermayer wrote:
> > On Sun, Aug 31, 2008 at 01:18:14PM +0300, Siarhei Siamashka wrote:
> [...]
> > > Regarding 'vorbis_residue_decode' function, it probably makes sense to
> > > optimize these loops:
> > >
> > > if(dim==2) {
> > >     for(k=0;k<step;++k) {
> > >         coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 2;
> > >         vec[voffs+k     ]+=codebook.codevectors[coffs  ];  // FPMATH
> > >         vec[voffs+k+vlen]+=codebook.codevectors[coffs+1];  // FPMATH
> > >     }
> > > } else if(dim==4) {
> > >     for(k=0;k<step;++k, voffs+=2) {
> > >         coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 4;
> > >         vec[voffs       ]+=codebook.codevectors[coffs  ];  // FPMATH
> > >         vec[voffs+1     ]+=codebook.codevectors[coffs+2];  // FPMATH
> > >         vec[voffs+vlen  ]+=codebook.codevectors[coffs+1];  // FPMATH
> > >         vec[voffs+vlen+1]+=codebook.codevectors[coffs+3];  // FPMATH
> > >     }
> > > } ...
> > >
> > > 'get_vlc2' call could be replaced with some GET_VLC/GET_RL_VLC variant
> > > so that the number of intermediate excessive UPDATE_CACHE operations is
> > > minimized.
> >
> > These are all nice ideas but they arent really related to the change here
> > so patch welcome
> >
> > [...]
> 
> Thank you for applying the previous patch.
> 
> Now here is some preliminary version (not intended for applying yet) of a new
> patch which tries to implement the ideas which were quoted above.
> 

> The first thing that is important for performance is that a vast majority
> of 'get_vlc2' calls perform only a singe table lookup. Probability of single

This is probably true for nearly all get_vlc* not only the ones in vorbis, so
i guess the (un)likely stuff can be added to the existing code.

> table lookup is usually higher than 90% when V_NB_BITS is equal to 8 and more
> than 95% if V_NB_BITS is increased to 9. And that results not only in a lower
> number of instructions executed, but also is much better for branch predictor
> reducing conditional jump overhead. I used a test code from 
> 'vorbis_vlcfreq.diff' (ugly hack) to measure these probabilities and 
> get the statistics.
> 
> As sometimes 'codebook_setup->maxdepth' is lower than V_NB_BITS, there is no
> point in creating larger VLC tables, it saves memory and reduces the number of
> cache misses. That's why 'if(codebook_setup->maxdepth < V_NB_BITS)
> codebook_setup->nb_bits=codebook_setup->maxdepth' line was added to code.
> 
> There is no point doing more than one UPDATE_CACHE operation per GET_VLC.
> Moreover, as GET_VLC reads at most 11 bits (and typically just V_NB_BITS 
> bits) per table lookup, it is possible to do UPDATE_CACHE once per two or
> even three calls to GET_VLC. 

> Generic SHOW_UBITS macro is also quite expensive.
> As the first table lookup in GET_VLC always uses a constant number of bits 
> as index, it is possible to pre-calculate bitmask and use it to speed up code.

This could be added as a SHOW_CONST_UBITS 
also gcc should be able to build the mask itself at compile time as long as
no asm shift tricks re used.

> 
> Most performance critical parts of 'vorbis_residue_decode' function were
> extracted into separate 'vorbis_residue_decode_inner_loop_dimN' functions
> for benchmarking purposes. Log from oprofile typically looks like this (for
> 256 kbit vorbis file):
> 
> CPU: PIII, speed 1862.24 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit 
> mask of 0x00 (No unit mask) count 99991
> samples %        image name              symbol name
> 1062    12.9449  ffmpeg_g                ff_imdct_half_sse
> 976     11.8966  ffmpeg_g                vorbis_residue_decode_inner_loop_dim2
> 953     11.6163  ffmpeg_g                ff_vorbis_floor1_render_list
> 745      9.0809  ffmpeg_g                vorbis_residue_decode
> 671      8.1789  ffmpeg_g                pass_sse.loop
> 577      7.0332  ffmpeg_g                vorbis_floor1_decode
> 389      4.7416  ffmpeg_g                vorbis_parse_audio_packet
> 384      4.6806  ffmpeg_g                fft16_sse
> 378      4.6075  ffmpeg_g                vorbis_residue_decode_inner_loop_dim4
> 290      3.5349  libc-2.7.so             (no symbols)
> 266      3.2423  ffmpeg_g                vorbis_inverse_coupling_sse
> 224      2.7304  ffmpeg_g                vector_fmul_window_sse
> 161      1.9625  ffmpeg_g                fft8_sse
> 153      1.8649  ffmpeg_g                vector_fmul_sse
> 
> Or sometimes like this (for 64 kbit vorbis file):
> 
> CPU: PIII, speed 1862.24 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit 
> mask of 0x00 (No unit mask) count 99991
> samples %        image name              symbol name
> 1112    16.8536  ffmpeg_g                ff_imdct_half_sse
> 891     13.5041  ffmpeg_g                ff_vorbis_floor1_render_list
> 656      9.9424  ffmpeg_g                pass_sse.loop
> 408      6.1837  ffmpeg_g                vorbis_residue_decode
> 384      5.8199  ffmpeg_g                fft16_sse
> 372      5.6381  ffmpeg_g                vorbis_floor1_decode
> 349      5.2895  ffmpeg_g                vorbis_parse_audio_packet
> 285      4.3195  ffmpeg_g                vorbis_residue_decode_inner_loop_dim8
> 252      3.8193  ffmpeg_g                vector_fmul_window_sse
> 240      3.6375  ffmpeg_g                vorbis_inverse_coupling_sse
> 237      3.5920  libc-2.7.so             (no symbols)
> 219      3.3192  ffmpeg_g                vorbis_residue_decode_inner_loop_dim2
> 161      2.4401  ffmpeg_g                fft8_sse
> 142      2.1522  ffmpeg_g                float_to_int16_interleave_sse2
> 118      1.7884  ffmpeg_g                vector_fmul_sse
> 110      1.6672  ffmpeg_g                build_table
> 83       1.2580  ffmpeg_g                av_interleave_packet_per_dts
> 79       1.1973  ffmpeg_g                pcm_encode_frame
> 72       1.0912  ffmpeg_g                output_packet
> >51      0.7730  ffmpeg_g                vorbis_residue_decode_inner_loop_dim4
> 35       0.5305  ffmpeg_g                av_encode
> 33       0.5002  ffmpeg_g                compute_pkt_fields2
> 
> In any case, 'vorbis_residue_decode_inner_loop_dimN' functions all together
> usually take most of the time of 'vorbis_residue_decode' execution.
> 

> The attached patch already provides a visible performance improvement (up to 
> 5% for 256 kbit vorbis file, lower bitrate files gain less). But these
> functions can be probably optimized using assembly (as gcc really does a poor 
> job allocating registers here) and SIMD instructions. I wonder of it makes 
> sense still inlining them, or moving them to dsputil would be also ok? 

I tend slightly toward spliting the vlc reading from the vector adding and
then optimize them seperately, and moving to dsputil for the cases where we
have asm that is faster than C. This would reduce te register pressure a
little and also make optimization easier.
Though it would require an intermediate array and its not clear which way
would be faster.
Of course if it turns out to be faster to do all in one iam fine with that
as well.

Either way, id be very happy if i had a clean patch i could apply/approve
that provides 5%, or more :) speedup. Sadly i dont have the time to really
do much work on this beyond reviewing. My ffmpeg related todo list is growing
far too long already.

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I am the wisest man alive, for I know one thing, and that is that I know
nothing. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080903/183029cd/attachment.pgp>