[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations
Tue Sep 9 07:46:14 CEST 2008
On Wednesday 03 September 2008, Michael Niedermayer wrote:
> > Now here is some preliminary version (not intended for applying yet) of a
> > new patch which tries to implement the ideas which were quoted above.
OK, now is the next iteration.
> > The first thing that is important for performance is that a vast majority
> > of 'get_vlc2' calls perform only a singe table lookup. Probability of
> > single
> This is probably true for nearly all get_vlc* not only the ones in vorbis,
> so i guess the (un)likely stuff can be added to the existing code.
Yes, there's lots of common stuff in different decoders, which can be
optimized in similar ways after vorbis is done. For the time being I
removed __builtin_expect code from my patch as the performance improvement
is small and it is probably better to be added in a separate patch.
> > table lookup is usually higher than 90% when V_NB_BITS is equal to 8 and
> > more than 95% if V_NB_BITS is increased to 9. And that results not only
> > in a lower number of instructions executed, but also is much better for
> > branch predictor reducing conditional jump overhead. I used a test code
> > from
> > 'vorbis_vlcfreq.diff' (ugly hack) to measure these probabilities and
> > get the statistics.
> > As sometimes 'codebook_setup->maxdepth' is lower than V_NB_BITS, there is
> > no point in creating larger VLC tables, it saves memory and reduces the
> > number of cache misses. That's why 'if(codebook_setup->maxdepth <
> > V_NB_BITS) codebook_setup->nb_bits=codebook_setup->maxdepth' line was
> > added to code.
> > There is no point doing more than one UPDATE_CACHE operation per GET_VLC.
> > Moreover, as GET_VLC reads at most 11 bits (and typically just V_NB_BITS
> > bits) per table lookup, it is possible to do UPDATE_CACHE once per two or
> > even three calls to GET_VLC.
> > Generic SHOW_UBITS macro is also quite expensive.
> > As the first table lookup in GET_VLC always uses a constant number of
> > bits as index, it is possible to pre-calculate bitmask and use it to
> > speed up code.
> This could be added as a SHOW_CONST_UBITS
> also gcc should be able to build the mask itself at compile time as long as
> no asm shift tricks re used.
Sure. The only problem is that it would be nice to use the same macro for both
constant and non-constant expressions. Adding one more macro does not add much
convenience because the compiler can't either insert a constant or use asm
shift trick automatically. Or can it?
> > Most performance critical parts of 'vorbis_residue_decode' function were
> > extracted into separate 'vorbis_residue_decode_inner_loop_dimN' functions
> > for benchmarking purposes. Log from oprofile typically looks like this
> > (for 256 kbit vorbis file):
> > The attached patch already provides a visible performance improvement (up
> > to 5% for 256 kbit vorbis file, lower bitrate files gain less). But these
> > functions can be probably optimized using assembly (as gcc really does a
> > poor job allocating registers here) and SIMD instructions. I wonder of it
> > makes sense still inlining them, or moving them to dsputil would be also
> > ok?
> I tend slightly toward spliting the vlc reading from the vector adding and
> then optimize them seperately, and moving to dsputil for the cases where we
> have asm that is faster than C. This would reduce te register pressure a
> little and also make optimization easier.
> Though it would require an intermediate array and its not clear which way
> would be faster.
> Of course if it turns out to be faster to do all in one iam fine with that
> as well.
I tried to split loop into two passes (vlc decoding into a temporary buffer
and vector adding separated), but it was noticeably slower in my tests (with
the code for these two loops generated by gcc). Maybe hand optimized
code could perform better.
> Either way, id be very happy if i had a clean patch i could apply/approve
> that provides 5%, or more :) speedup. Sadly i dont have the time to really
> do much work on this beyond reviewing. My ffmpeg related todo list is
> growing far too long already.
Feedback and review from vorbis decoder maintainer and also from Loren Merritt
is welcome too.
A new patch now calls functions via a pointer. A loss here is that it is an
indirect jump. On the other hand, it also eliminates a set of if/else
constructions. Small tweaks also include elimination of FASTDIV in some
Some basic SSE optimizations are added, most likely they still can be
improved. As for implementing the functions completely in assembly, there
are two problems. If yasm is used, it needs to have knowledge about vlc
table binary format and also work with GetBitContext structure, and that
can be a bit fragile. If gcc inline assembly is used, there are some
troubles with the number of available registers, and optimizing register
allocation is what can give the most benefit here, so it could be also
Well, right now I get vorbis decoding performance improvement close
to ~8% on pentium-m. A lot more is gained on x86-64 core2. It was
not a clean test as I tried it in ssh session in virtual hosting
but I got something like ~16% improvement. Probably x86 with the
tiny number of registers is the platform which gains the least here.
Benchmark on ARM shows a very good speedup, not as spectacular as
x86-64, but it still also does not have assembly optimizations
Benchmarks from other platforms would be interesting to see, big endian
platforms can benefit from lower number of UPDATE_CACHE, which need
endian conversion there.
Regarding cleaning the patch, I wonder how it would be better to split
code between different files (if it is needed of course)?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 27382 bytes
Desc: not available
More information about the ffmpeg-devel