[FFmpeg-devel] [PATCH 3/3] Use DSPContext.vector_fmul() and DSPContext.vector_fmul_reverse() in floating-point version of apply_window(). 46% faster in function apply_window().
Justin Ruggles
justin.ruggles
Tue Jan 4 17:31:11 CET 2011
On 01/01/2011 10:30 PM, Justin Ruggles wrote:
> On 01/01/2011 10:09 PM, Michael Niedermayer wrote:
>
>> On Fri, Dec 31, 2010 at 03:11:40PM -0500, Justin Ruggles wrote:
>>> diff --git libavcodec/ac3enc_float.c libavcodec/ac3enc_float.c
>>> index 6a061d6..addc84f 100644
>>> --- libavcodec/ac3enc_float.c
>>> +++ libavcodec/ac3enc_float.c
>>> @@ -77,16 +77,13 @@ static void mdct512(AC3MDCTContext *mdct, float *out, float *in)
>>> /**
>>> * Apply KBD window to input samples prior to MDCT.
>>> */
>>> -static void apply_window(float *output, const float *input,
>>> +static void apply_window(DSPContext *dsp, float *output, const float *input,
>>> const float *window, int n)
>>> {
>>> - int i;
>>> int n2 = n >> 1;
>>> -
>>> - for (i = 0; i < n2; i++) {
>>> - output[i] = input[i] * window[i];
>>> - output[n-i-1] = input[n-i-1] * window[i];
>>> - }
>>> + memcpy(output, input, n2 * sizeof(*input));
>>> + dsp->vector_fmul(output, window, n2);
>>> + dsp->vector_fmul_reverse(output+n2, input+n2, window, n2);
>>
>> The memcpy is ugly
>
>
> yeah, I know... I'll see if I can implement a new version of
> vector_fmul that will handle different input from output and compare the
> speed.
Currently we have vector_fmul() for: C, neon, vfp, altivec, 3dnow, sse
I implemented vector_fmul_copy() for C, altivec, 3dnow, and sse to use 2
src and 1 dst. The Altivec version of vector_fmul_copy() has not been
tested, but I implemented it in the hope that someone else will test and
review it. Here are some benchmarks on my Athlon64. benchmark numbers
are in dezicycles.
C (current SVN): 13366
memcpy(256) + vector_fmul(256) + vector_fmul_reverse(256)
C: 18014
3DNow: 10193
SSE: 8685
vector_fmul_copy(256) + vector_fmul_reverse(256)
C: 16312
3DNow: 8682
SSE: 7280
vector_fmul_copy(512)
C: 16165
3DNow: 6043
SSE: 6193
Note that the 3DNow version of vector_fmul_copy(512) is faster on my
system for some reason... I'm not sure how to detect this case or if it
is consistent across all CPUs, all Athlon64, or whatever.
I also tried to rewrite the current C version in SSE. It was faster
than the fmul_copy+fmul_reverse since it basically merges the 2 loops,
but it was slower than vector_fmul_copy(512). I left that out of the
patch. If anyone is interested I can send it...
vector_fmul_window2(512)
SSE: 7021
Thanks,
Justin
More information about the ffmpeg-devel
mailing list