[FFmpeg-devel] [RFC/PATCH] More flexible variant of float_to_int16, WMA optimization, Vorbis
Michael Niedermayer
michaelni
Mon Jul 7 23:44:58 CEST 2008
On Mon, Jul 07, 2008 at 10:39:32AM +0300, Siarhei Siamashka wrote:
> Hi,
>
> Here is a patch which adds a bit more flexible variant of 'float_to_int16'
> function ('more_flexible_variant_of_float_to_int16.diff').
>
> It can be used for quite a noticeable WMA decoding performance improvement
> ('float_to_int16_wma.diff'), which is at least ~15% in my tests. Using current
> 'float_to_int16' is hard for WMA without introducing unnecessary intermediate
> operations involving interleaving samples in temporary buffer.
Maybe, but this doesnt mean we should not benchmark that intermediate
operation + current float_to_int16().
I really do not care how much faster your new asm code is compared to ISO C.
What i am interrested in is how it does compare to the current asm code.
>
> Currently 'dca.c', 'ac3dec.c' use extra code for interleaving samples and can
> be optimized.
>
> Also 'float_to_int16_vorbis.diff' contains a patch which moves channel
> interleaving logic from 'vector_*' function to 'float_to_int16_*'. It
> simplifies the logic in 'vorbis_parse_audio_packet' and creates
> opportunities for further optimizations. Also it makes vorbis decoding
> a bit faster (something like ~1.5% in my tests on Pentium-M) because of
Ive just optimized float_to_int16 a little (it matches your proposed code
in terms of optimizations now)
So i think you should redo the benchmarks
[...]
> Regarding the subject, does it make sense to completely replace current
> 'float_to_int16' function and use a new one instead? Using new function
> instead of old one is simple (though a bit cumbersome because it would
> require creating a temporary array with a single entry, holding a pointer to
> samples). And using old function is problematic for at least WMA, DCA, AC3.
>
> Problems to solve are efficient handling of non 1 or 2 channels case. It needs
> to be investigated if a generic variant can be optimized well (at least it
> should be faster than manual interleaving of samples) and what other special
> number of channels cases should be handled. Also I can do ARM VFP
> optimizations, but 3NOW and Altivec versions would be needed.
It has been discussed already a few times ....
our decoders should output their native format, that may be planar floats
for several of these codecs
And a converter (containing your proposed float_to_int16) could
then convert and interleave this when the user app / audio out hardware
can not handle planar audio.
Are you interrested to work on such converter?
[...]
> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c (revision 14080)
> +++ libavcodec/i386/dsputil_mmx.c (working copy)
> @@ -2052,6 +2052,43 @@
> asm volatile("emms");
> }
>
> +static void float_to_int16_join_channels_sse(int16_t *dst, const float *src[], int len, int ch){
> + int i, k;
> + if (ch == 1) {
> + float_to_int16_sse(dst, src[0], len);
> + } else if (ch == 2 && !(len & 3)) {
> + x86_reg idx = -(len*4);
> + asm volatile(
> + "1: \n\t"
> + "cvtps2pi 0(%2, %0), %%mm0 \n\t"
> + "cvtps2pi 0(%3, %0), %%mm1 \n\t"
> + "cvtps2pi 8(%2, %0), %%mm2 \n\t"
> + "cvtps2pi 8(%3, %0), %%mm3 \n\t"
> + "add $16, %0 \n\t"
> + "packssdw %%mm1, %%mm0 \n\t"
> + "packssdw %%mm3, %%mm2 \n\t"
> + "pshufw $0xD8, %%mm0, %%mm0 \n\t"
> + "pshufw $0xD8, %%mm2, %%mm2 \n\t"
> + "movq %%mm0, -16(%1, %0) \n\t"
> + "movq %%mm2, -8(%1, %0) \n\t"
> + "js 1b \n\t"
> + "emms \n\t"
> + : "+r" (idx)
> + : "r" (dst + len*2), "r"(src[0] + len), "r"(src[1] + len)
> + : "cc", "memory"
> + );
> + } else {
> + for(k = 0; k < ch; k++) {
> + int16_t *ptr = dst + k;
> + const float *iptr = src[k];
> + for(i = 0; i < len; i++) {
> + *ptr = av_clip_int16(lrintf(*iptr++));
> + ptr += ch;
> + }
> + }
> + }
> +}
I think the code in else will perform very poor speed wise on any
modern cpu which fill cache lines on write misses
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/23646d1e/attachment.pgp>
More information about the ffmpeg-devel
mailing list