[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations
Michael Niedermayer
michaelni
Wed Sep 3 05:23:02 CEST 2008
On Tue, Sep 02, 2008 at 06:03:15AM +0300, Siarhei Siamashka wrote:
> On Sunday 31 August 2008, Michael Niedermayer wrote:
> > On Sun, Aug 31, 2008 at 01:18:14PM +0300, Siarhei Siamashka wrote:
> [...]
> > > Regarding 'vorbis_residue_decode' function, it probably makes sense to
> > > optimize these loops:
> > >
> > > if(dim==2) {
> > > for(k=0;k<step;++k) {
> > > coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 2;
> > > vec[voffs+k ]+=codebook.codevectors[coffs ]; // FPMATH
> > > vec[voffs+k+vlen]+=codebook.codevectors[coffs+1]; // FPMATH
> > > }
> > > } else if(dim==4) {
> > > for(k=0;k<step;++k, voffs+=2) {
> > > coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 4;
> > > vec[voffs ]+=codebook.codevectors[coffs ]; // FPMATH
> > > vec[voffs+1 ]+=codebook.codevectors[coffs+2]; // FPMATH
> > > vec[voffs+vlen ]+=codebook.codevectors[coffs+1]; // FPMATH
> > > vec[voffs+vlen+1]+=codebook.codevectors[coffs+3]; // FPMATH
> > > }
> > > } ...
> > >
> > > 'get_vlc2' call could be replaced with some GET_VLC/GET_RL_VLC variant
> > > so that the number of intermediate excessive UPDATE_CACHE operations is
> > > minimized.
> >
> > These are all nice ideas but they arent really related to the change here
> > so patch welcome
> >
> > [...]
>
> Thank you for applying the previous patch.
>
> Now here is some preliminary version (not intended for applying yet) of a new
> patch which tries to implement the ideas which were quoted above.
>
> The first thing that is important for performance is that a vast majority
> of 'get_vlc2' calls perform only a singe table lookup. Probability of single
This is probably true for nearly all get_vlc* not only the ones in vorbis, so
i guess the (un)likely stuff can be added to the existing code.
> table lookup is usually higher than 90% when V_NB_BITS is equal to 8 and more
> than 95% if V_NB_BITS is increased to 9. And that results not only in a lower
> number of instructions executed, but also is much better for branch predictor
> reducing conditional jump overhead. I used a test code from
> 'vorbis_vlcfreq.diff' (ugly hack) to measure these probabilities and
> get the statistics.
>
> As sometimes 'codebook_setup->maxdepth' is lower than V_NB_BITS, there is no
> point in creating larger VLC tables, it saves memory and reduces the number of
> cache misses. That's why 'if(codebook_setup->maxdepth < V_NB_BITS)
> codebook_setup->nb_bits=codebook_setup->maxdepth' line was added to code.
>
> There is no point doing more than one UPDATE_CACHE operation per GET_VLC.
> Moreover, as GET_VLC reads at most 11 bits (and typically just V_NB_BITS
> bits) per table lookup, it is possible to do UPDATE_CACHE once per two or
> even three calls to GET_VLC.
> Generic SHOW_UBITS macro is also quite expensive.
> As the first table lookup in GET_VLC always uses a constant number of bits
> as index, it is possible to pre-calculate bitmask and use it to speed up code.
This could be added as a SHOW_CONST_UBITS
also gcc should be able to build the mask itself at compile time as long as
no asm shift tricks re used.
>
> Most performance critical parts of 'vorbis_residue_decode' function were
> extracted into separate 'vorbis_residue_decode_inner_loop_dimN' functions
> for benchmarking purposes. Log from oprofile typically looks like this (for
> 256 kbit vorbis file):
>
> CPU: PIII, speed 1862.24 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit
> mask of 0x00 (No unit mask) count 99991
> samples % image name symbol name
> 1062 12.9449 ffmpeg_g ff_imdct_half_sse
> 976 11.8966 ffmpeg_g vorbis_residue_decode_inner_loop_dim2
> 953 11.6163 ffmpeg_g ff_vorbis_floor1_render_list
> 745 9.0809 ffmpeg_g vorbis_residue_decode
> 671 8.1789 ffmpeg_g pass_sse.loop
> 577 7.0332 ffmpeg_g vorbis_floor1_decode
> 389 4.7416 ffmpeg_g vorbis_parse_audio_packet
> 384 4.6806 ffmpeg_g fft16_sse
> 378 4.6075 ffmpeg_g vorbis_residue_decode_inner_loop_dim4
> 290 3.5349 libc-2.7.so (no symbols)
> 266 3.2423 ffmpeg_g vorbis_inverse_coupling_sse
> 224 2.7304 ffmpeg_g vector_fmul_window_sse
> 161 1.9625 ffmpeg_g fft8_sse
> 153 1.8649 ffmpeg_g vector_fmul_sse
>
> Or sometimes like this (for 64 kbit vorbis file):
>
> CPU: PIII, speed 1862.24 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit
> mask of 0x00 (No unit mask) count 99991
> samples % image name symbol name
> 1112 16.8536 ffmpeg_g ff_imdct_half_sse
> 891 13.5041 ffmpeg_g ff_vorbis_floor1_render_list
> 656 9.9424 ffmpeg_g pass_sse.loop
> 408 6.1837 ffmpeg_g vorbis_residue_decode
> 384 5.8199 ffmpeg_g fft16_sse
> 372 5.6381 ffmpeg_g vorbis_floor1_decode
> 349 5.2895 ffmpeg_g vorbis_parse_audio_packet
> 285 4.3195 ffmpeg_g vorbis_residue_decode_inner_loop_dim8
> 252 3.8193 ffmpeg_g vector_fmul_window_sse
> 240 3.6375 ffmpeg_g vorbis_inverse_coupling_sse
> 237 3.5920 libc-2.7.so (no symbols)
> 219 3.3192 ffmpeg_g vorbis_residue_decode_inner_loop_dim2
> 161 2.4401 ffmpeg_g fft8_sse
> 142 2.1522 ffmpeg_g float_to_int16_interleave_sse2
> 118 1.7884 ffmpeg_g vector_fmul_sse
> 110 1.6672 ffmpeg_g build_table
> 83 1.2580 ffmpeg_g av_interleave_packet_per_dts
> 79 1.1973 ffmpeg_g pcm_encode_frame
> 72 1.0912 ffmpeg_g output_packet
> >51 0.7730 ffmpeg_g vorbis_residue_decode_inner_loop_dim4
> 35 0.5305 ffmpeg_g av_encode
> 33 0.5002 ffmpeg_g compute_pkt_fields2
>
> In any case, 'vorbis_residue_decode_inner_loop_dimN' functions all together
> usually take most of the time of 'vorbis_residue_decode' execution.
>
> The attached patch already provides a visible performance improvement (up to
> 5% for 256 kbit vorbis file, lower bitrate files gain less). But these
> functions can be probably optimized using assembly (as gcc really does a poor
> job allocating registers here) and SIMD instructions. I wonder of it makes
> sense still inlining them, or moving them to dsputil would be also ok?
I tend slightly toward spliting the vlc reading from the vector adding and
then optimize them seperately, and moving to dsputil for the cases where we
have asm that is faster than C. This would reduce te register pressure a
little and also make optimization easier.
Though it would require an intermediate array and its not clear which way
would be faster.
Of course if it turns out to be faster to do all in one iam fine with that
as well.
Either way, id be very happy if i had a clean patch i could apply/approve
that provides 5%, or more :) speedup. Sadly i dont have the time to really
do much work on this beyond reviewing. My ffmpeg related todo list is growing
far too long already.
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I am the wisest man alive, for I know one thing, and that is that I know
nothing. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080903/183029cd/attachment.pgp>
More information about the ffmpeg-devel
mailing list