[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Tue Sep 2 05:03:15 CEST 2008

On Sunday 31 August 2008, Michael Niedermayer wrote:
> On Sun, Aug 31, 2008 at 01:18:14PM +0300, Siarhei Siamashka wrote:
[...]
> > Regarding 'vorbis_residue_decode' function, it probably makes sense to
> > optimize these loops:
> >
> > if(dim==2) {
> >     for(k=0;k<step;++k) {
> >         coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 2;
> >         vec[voffs+k     ]+=codebook.codevectors[coffs  ];  // FPMATH
> >         vec[voffs+k+vlen]+=codebook.codevectors[coffs+1];  // FPMATH
> >     }
> > } else if(dim==4) {
> >     for(k=0;k<step;++k, voffs+=2) {
> >         coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 4;
> >         vec[voffs       ]+=codebook.codevectors[coffs  ];  // FPMATH
> >         vec[voffs+1     ]+=codebook.codevectors[coffs+2];  // FPMATH
> >         vec[voffs+vlen  ]+=codebook.codevectors[coffs+1];  // FPMATH
> >         vec[voffs+vlen+1]+=codebook.codevectors[coffs+3];  // FPMATH
> >     }
> > } ...
> >
> > 'get_vlc2' call could be replaced with some GET_VLC/GET_RL_VLC variant
> > so that the number of intermediate excessive UPDATE_CACHE operations is
> > minimized.
>
> These are all nice ideas but they arent really related to the change here
> so patch welcome
>
> [...]

Thank you for applying the previous patch.

Now here is some preliminary version (not intended for applying yet) of a new
patch which tries to implement the ideas which were quoted above.

The first thing that is important for performance is that a vast majority
of 'get_vlc2' calls perform only a singe table lookup. Probability of single
table lookup is usually higher than 90% when V_NB_BITS is equal to 8 and more
than 95% if V_NB_BITS is increased to 9. And that results not only in a lower
number of instructions executed, but also is much better for branch predictor
reducing conditional jump overhead. I used a test code from 
'vorbis_vlcfreq.diff' (ugly hack) to measure these probabilities and 
get the statistics.

As sometimes 'codebook_setup->maxdepth' is lower than V_NB_BITS, there is no
point in creating larger VLC tables, it saves memory and reduces the number of
cache misses. That's why 'if(codebook_setup->maxdepth < V_NB_BITS)
codebook_setup->nb_bits=codebook_setup->maxdepth' line was added to code.

There is no point doing more than one UPDATE_CACHE operation per GET_VLC.
Moreover, as GET_VLC reads at most 11 bits (and typically just V_NB_BITS 
bits) per table lookup, it is possible to do UPDATE_CACHE once per two or
even three calls to GET_VLC. Generic SHOW_UBITS macro is also quite expensive.
As the first table lookup in GET_VLC always uses a constant number of bits 
as index, it is possible to pre-calculate bitmask and use it to speed up code.

Most performance critical parts of 'vorbis_residue_decode' function were
extracted into separate 'vorbis_residue_decode_inner_loop_dimN' functions
for benchmarking purposes. Log from oprofile typically looks like this (for
256 kbit vorbis file):

CPU: PIII, speed 1862.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit 
mask of 0x00 (No unit mask) count 99991
samples %        image name              symbol name
1062    12.9449  ffmpeg_g                ff_imdct_half_sse
976     11.8966  ffmpeg_g                vorbis_residue_decode_inner_loop_dim2
953     11.6163  ffmpeg_g                ff_vorbis_floor1_render_list
745      9.0809  ffmpeg_g                vorbis_residue_decode
671      8.1789  ffmpeg_g                pass_sse.loop
577      7.0332  ffmpeg_g                vorbis_floor1_decode
389      4.7416  ffmpeg_g                vorbis_parse_audio_packet
384      4.6806  ffmpeg_g                fft16_sse
378      4.6075  ffmpeg_g                vorbis_residue_decode_inner_loop_dim4
290      3.5349  libc-2.7.so             (no symbols)
266      3.2423  ffmpeg_g                vorbis_inverse_coupling_sse
224      2.7304  ffmpeg_g                vector_fmul_window_sse
161      1.9625  ffmpeg_g                fft8_sse
153      1.8649  ffmpeg_g                vector_fmul_sse

Or sometimes like this (for 64 kbit vorbis file):

CPU: PIII, speed 1862.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit 
mask of 0x00 (No unit mask) count 99991
samples %        image name              symbol name
1112    16.8536  ffmpeg_g                ff_imdct_half_sse
891     13.5041  ffmpeg_g                ff_vorbis_floor1_render_list
656      9.9424  ffmpeg_g                pass_sse.loop
408      6.1837  ffmpeg_g                vorbis_residue_decode
384      5.8199  ffmpeg_g                fft16_sse
372      5.6381  ffmpeg_g                vorbis_floor1_decode
349      5.2895  ffmpeg_g                vorbis_parse_audio_packet
285      4.3195  ffmpeg_g                vorbis_residue_decode_inner_loop_dim8
252      3.8193  ffmpeg_g                vector_fmul_window_sse
240      3.6375  ffmpeg_g                vorbis_inverse_coupling_sse
237      3.5920  libc-2.7.so             (no symbols)
219      3.3192  ffmpeg_g                vorbis_residue_decode_inner_loop_dim2
161      2.4401  ffmpeg_g                fft8_sse
142      2.1522  ffmpeg_g                float_to_int16_interleave_sse2
118      1.7884  ffmpeg_g                vector_fmul_sse
110      1.6672  ffmpeg_g                build_table
83       1.2580  ffmpeg_g                av_interleave_packet_per_dts
79       1.1973  ffmpeg_g                pcm_encode_frame
72       1.0912  ffmpeg_g                output_packet
>51      0.7730  ffmpeg_g                vorbis_residue_decode_inner_loop_dim4
35       0.5305  ffmpeg_g                av_encode
33       0.5002  ffmpeg_g                compute_pkt_fields2

In any case, 'vorbis_residue_decode_inner_loop_dimN' functions all together
usually take most of the time of 'vorbis_residue_decode' execution.

The attached patch already provides a visible performance improvement (up to 
5% for 256 kbit vorbis file, lower bitrate files gain less). But these
functions can be probably optimized using assembly (as gcc really does a poor 
job allocating registers here) and SIMD instructions. I wonder of it makes 
sense still inlining them, or moving them to dsputil would be also ok? 
Calling them via a pointer also makes it possible to create several versions 
with 'codebook_nb_bits' and 'codebook_nb_bits_mask' hardcoded as immediate 
operands in order to optimize registers allocation (values 8 and 11 
for 'codebook_nb_bits' might be used) and also in order to do UPDATE_CACHE
operations in an optimal way (once per 3 GET_VLC or once per 2 GET_VLC
operations). A pointer to the appropriate function could be set in each
instance of 'vorbis_codebook'

If assembly optimizations are going to be used for dealing with VLC tables,
this should be covered well by regressions tests in order to avoid potential
sudden breakages if changes may get introduced to the implementations of VLC
tables format in the future.

Other things to optimize later are 'vorbis_floor0_decode' as it is *very* slow 
when used (you have provided a link to an interesting file in one of the 
previous posts: http://www.gnu.org/fun/jokes/eternal-flame.ogg). And also
'render_line', which is a major contributor to 'ff_vorbis_floor1_render_list' 
time from the oprofile logs above. Both can be optimized quite trivially.
After implementation of a faster split-radix FFT was added, all these
performance bottlenecks are now much more visible ;)

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vorbis_vlcfreq.diff
Type: text/x-diff
Size: 1741 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080902/fd379821/attachment.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vorbis_vlc_opt_preview.diff
Type: text/x-diff
Size: 7937 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080902/fd379821/attachment-0001.diff>