[FFmpeg-devel] [RFC/PATCH] More flexible variafloat_to_int16 , WMA optimization, Vorbis

Wed Jul 16 00:28:32 CEST 2008

On Tuesday 15 July 2008, Loren Merritt wrote:
> On Tue, 15 Jul 2008, Siarhei Siamashka wrote:
> > On Tuesday 15 July 2008, Loren Merritt wrote:
> >> But eliminating the memcpy requires increasing the amount of memory
> >> used, since you then need to keep one saved array per channel plus one
> >> for the current block to be pointer-swapped. This is faster if the data
> >> still fits in L1 after that expansion, but slower if you have an old cpu
> >> with a small cache.
> >
> > Why increasing memory? We still keep "vc->saved" buffer and all the
> > needed data ends up in it after each iteration. Maybe imdct_half could
> > produce not contiguous output, but store part of the data directly to
> > "vc->ret" and part of the data directly to "saved" in order to avoid
> > moving bytes around later.
>
> Before: the right half of the current imdct'ed block can be in any old
> temp buffer, and is copied into saved[] after we're done using the
> previous value of saved[].
>
> After: the right half of the current imdct'ed block must be in a buffer of
> size blocksize/4, which can be swapped with the previous saved[]. We can't
> write the imdct'ed block directly into saved[], since we need both values
> at the same time. There aren't any other arrays of exactly the right size
> to cannibalize, and we can't re-use something bigger or we're wasting even
> more memory due to increased size of the other saved[] entries.

Well, merging the loops that are run after iFFT and combining them with
windowing code can probably provide interesting results. At least it should
eliminate a lot of intermediate load and store operations. Maybe having iFFT
output processed in a single loop could allow reading old saved data and
also replace it with new saved data at the same time? At least in some
simple cases when previous and current blocks have the same size.

I'm not suggesting anything new here. IIRC it was long discussed, but looks
like we finally have a chance to get it implemented as things have moved
from a dead point :)

> See patch (which won't apply to svn, since it depends on other patches I
> haven't committed yet, but the strategy should be clear).

Hmm, did you forget to attach this patch?

[...]

> > By the way, have you benchmarked SSE2 optimized "float_to_in16_*"
> > functions? On what kind of CPU they should be faster than SSE versions?
>
> (cycles)
> k8:
> 4676 float_to_int16_c
>   818 float_to_int16_3dnow
>   698 float_to_int16_sse
>   691 float_to_int16_sse2
> 6654 float_to_int16_interleave_c
> 1965 float_to_int16_interleave_3dnow
> 1161 float_to_int16_interleave_sse
> 1304 float_to_int16_interleave_sse2
>
> conroe:
> 3040 float_to_int16_c
>   457 float_to_int16_sse
>   356 float_to_int16_sse2
> 4586 float_to_int16_interleave_c
> 1030 float_to_int16_interleave_sse
> 1071 float_to_int16_interleave_sse2
>
> penryn:
> 3164 float_to_int16_c
>   505 float_to_int16_sse
>   324 float_to_int16_sse2
> 4910 float_to_int16_interleave_c
> 1062 float_to_int16_interleave_sse
>   782 float_to_int16_interleave_sse2
>
> prescott-celeron:
> 8770 float_to_int16_c
> 1596 float_to_int16_sse
>   738 float_to_int16_sse2
> 3670 float_to_int16_interleave_c
> 3500 float_to_int16_interleave_sse
> 2219 float_to_int16_interleave_sse2

OK thanks. I just asked because I also benchmarked SSE vs. SSE2 on pentium-m
and core2 (conroe?) and was surprised to see SSE2 version of
float_to_int16_interleave being worse in both cases.

But could you also benchmark SSE version of float_to_int16_interleave from
my original submission on the cores where SSE2 was winning? It is quite a bit
faster than the code from SVN in my tests:

FLOAT_TO_INT16_INTERLEAVE(sse,
    "1:                              \n"
    "cvtps2pi  (%2,%0), %%mm0        \n"
    "cvtps2pi 8(%2,%0), %%mm2        \n"
    "cvtps2pi  (%3,%0), %%mm1        \n"
    "cvtps2pi 8(%3,%0), %%mm3        \n"
    "add         $16,   %0           \n"
    "packssdw    %%mm1, %%mm0        \n"
    "packssdw    %%mm3, %%mm2        \n"
    "pshufw      $0xD8, %%mm0, %%mm0 \n"
    "pshufw      $0xD8, %%mm2, %%mm2 \n"
    "movq        %%mm0, -16(%1,%0)   \n"
    "movq        %%mm2, -8(%1,%0)    \n"
    "js 1b                           \n"
    "emms                            \n"
)

Benchmarks from Pentium-M:

Current SVN:
12749 dezicycles in float_to_int16_interleave_sse, 4091 runs, 5 skips

Alternative version with 'pshufw':
10719 dezicycles in float_to_int16_interleave_sse, 4094 runs, 2 skips

I also tried it on core2 and it was also faster there, but I have no access
to that PC at the moment and don't have the results at hand.

-- 
Best regards,
Siarhei Siamashka