[FFmpeg-devel] [RFC/PATCH] More flexible variant of float_to_int16 , WMA optimization, Vorbis

Mon Jul 14 07:20:19 CEST 2008

On Tuesday 08 July 2008, Michael Niedermayer wrote:
> On Mon, Jul 07, 2008 at 10:39:32AM +0300, Siarhei Siamashka wrote:
> > Here is a patch which adds a bit more flexible variant of
> > 'float_to_int16' function
> > ('more_flexible_variant_of_float_to_int16.diff').
> >
> > It can be used for quite a noticeable WMA decoding performance
> > improvement ('float_to_int16_wma.diff'), which is at least ~15% in my
> > tests. Using current 'float_to_int16' is hard for WMA without introducing
> > unnecessary intermediate operations involving interleaving samples in
> > temporary buffer.
>
> Maybe, but this doesnt mean we should not benchmark that intermediate
> operation + current float_to_int16().

Something like benchmark vs. this patch:
http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2008-May/046620.html
?

Without intermediate temporary buffer, performance is ~5% better, actually
it's something that was to be expected even without benchmarks...

Or you wanted to suggest trying interleaved output of 
imdct/vector_fmul_add_add for WMA just like it was used for vorbis? 
Theoretically it could be also benchmarked, but it's a waste of time in my
opinion, considering the results we see in the vorbis case.

> I really do not care how much faster your new asm code is compared to ISO
> C. What i am interrested in is how it does compare to the current asm code.

That was actually not the main goal of my post. I just wanted to point out:
1. interleaving of channels at decoding stage vs. interleaving at conversion
to int16 format stage.
2. float_to_int16 API inflexibility problems which make it hard to use for
WMA, DCA, AC3 and probably some other decoders.

> > Currently 'dca.c', 'ac3dec.c' use extra code for interleaving samples and
> > can be optimized.
> >
> > Also 'float_to_int16_vorbis.diff' contains a patch which moves channel
> > interleaving logic from 'vector_*' function to 'float_to_int16_*'. It
> > simplifies the logic in 'vorbis_parse_audio_packet' and creates
> > opportunities for further optimizations. Also it makes vorbis decoding
> > a bit faster (something like ~1.5% in my tests on Pentium-M) because of
>
> Ive just optimized float_to_int16 a little (it matches your proposed code
> in terms of optimizations now)
> So i think you should redo the benchmarks

My benchmarks showed that doing interleaving at conversion to int16 format is
still faster.

Anyway, looks like vorbis maintainer has read our discussion and applied the
above mentioned optimizations in commits 14205 and 14207. That's a good
progress and some optimization opportunities still exist.

For example, it is possible to get rid of "memcpy(saved, buf+blocksize/4, 
blocksize/4*sizeof(float))" and probably "vc->buf", performing output 
directly to "vc->ret" and "vc->saved" from "fft.imdct_half".
It should further improve both performance and L1 cache use, making vorbis
decoder even better than it is now.

Another possible cache related optimization is to get rid of temporary buffer
in imdct and try performing bit-reverse reordering in-place. But it may turn
out to be actually slower (require more instructions) and not so flexible as
split-radix fft from djbfft uses a different, less symmetrical reordering.

Still, the newly added 'float_to_int16_interleave' function uses a fixed
stride between planes, which may make its use not so straightforward for WMA
and other codecs. But I may give it a try.

> > Regarding the subject, does it make sense to completely replace current
> > 'float_to_int16' function and use a new one instead? Using new function
> > instead of old one is simple (though a bit cumbersome because it would
> > require creating a temporary array with a single entry, holding a pointer
> > to samples). And using old function is problematic for at least WMA, DCA,
> > AC3.
> >
> > Problems to solve are efficient handling of non 1 or 2 channels case. It
> > needs to be investigated if a generic variant can be optimized well (at
> > least it should be faster than manual interleaving of samples) and what
> > other special number of channels cases should be handled. Also I can do
> > ARM VFP optimizations, but 3NOW and Altivec versions would be needed.
>
> It has been discussed already a few times ....
> our decoders should output their native format, that may be planar floats
> for several of these codecs
> And a converter (containing your proposed float_to_int16) could
> then convert and interleave this when the user app / audio out hardware
> can not handle planar audio.
> Are you interrested to work on such converter?

I'm not sure if I understand you completely. Do you suggest making it some
kind of audio format conversion filter used outside libavcodec? I'm actually
more interested in just improving performance (primarily for ARM).

-- 
Best regards,
Siarhei Siamashka