[FFmpeg-devel] [RFC/RFBench] AVX FFT
Vitor Sessak
vitor1001 at gmail.com
Fri Apr 1 20:41:25 CEST 2011
On 04/01/2011 08:16 PM, Michael Niedermayer wrote:
> On Fri, Apr 01, 2011 at 07:12:47PM +0200, Vitor Sessak wrote:
>> Hi,
>>
>> The following patches add an AVX (an intel x86 extension) FFT
>> implementation. Since I do not have a Sandybridge myself, I have no idea
>> of its performance. Benchmarks (for ex., using fft-test -s) are thus
>> very welcome. Also welcome are suggestions for optimizing it further, in
>> particular the 8 point FFT (in the T8_AVX macro), which is not much
>> faster than the SSE version.
>>
>> One thing noteworthy about AVX is that it uses 256 bits registers, so
>> now av_malloc needs to align the pointers to 32-byte boundaries. If this
>> patch is accepted, I'll have to change a bunch of audio decoders to
>> increase their buffers' alignment (note that AVX does not crash if a
>> 256-bit load is done on a 128-bit aligned pointer, but it will cause a
>> cache miss and thus a performance hit).
>>
>> -Vitor
>>
>> PS: cross-posted to both lists since I'm interested in feedback from
>> both groups.
>
> Note, i dont know AVX (yet) and dont have a CPU that supports it
> review below is thus a bit lame. The code looks largels ok though
> for someone not having had time to look at the datasheets
>
>
> [...]
>> --- a/libavcodec/x86/fft_mmx.asm
>> +++ b/libavcodec/x86/fft_mmx.asm
>> @@ -1,6 +1,7 @@
>> ;******************************************************************************
>> ;* FFT transform with SSE/3DNow optimizations
>> ;* Copyright (c) 2008 Loren Merritt
>> +;* AVX ASM Copyright (c) 2011 Vitor Sessak
>> ;*
>> ;* This algorithm (though not any of the implementation details) is
>> ;* based on libdjbfft by D. J. Bernstein.
>> @@ -49,11 +50,22 @@ endstruc
>> SECTION_RODATA
>>
>> %define M_SQRT1_2 0.70710678118654752440
>> -ps_root2: times 4 dd M_SQRT1_2
>> -ps_root2mppm: dd -M_SQRT1_2, M_SQRT1_2, M_SQRT1_2, -M_SQRT1_2
>> -ps_p1p1m1p1: dd 0, 0, 1<<31, 0
>> +%define M_COS_PI_1_8 0.923879532511287
>> +%define M_COS_PI_3_8 0.38268343236509
>> +
>> +ps_cos16_1: dd 1.0, M_COS_PI_1_8, M_SQRT1_2, M_COS_PI_3_8, 1.0, M_COS_PI_1_8, M_SQRT1_2, M_COS_PI_3_8
>> +ps_cos16_2: dd 0, M_COS_PI_3_8, M_SQRT1_2, M_COS_PI_1_8, 0, -M_COS_PI_3_8, -M_SQRT1_2, -M_COS_PI_1_8
>> +
>> +ps_root2: times 8 dd M_SQRT1_2
>> +ps_root2mppm: dd -M_SQRT1_2, M_SQRT1_2, M_SQRT1_2, -M_SQRT1_2, -M_SQRT1_2, M_SQRT1_2, M_SQRT1_2, -M_SQRT1_2
>> +ps_p1p1m1p1: dd 0, 0, 1<<31, 0, 0, 0, 1<<31, 0
>> ps_m1p1: dd 1<<31, 0
>>
>
>> +perm1: dd 0x00, 0x02, 0x03, 0x01, 0x03, 0x00, 0X02, 0x01
>> +perm2: dd 0x00, 0x01, 0x02, 0x03, 0x01, 0x00, 0X02, 0x03
> ^
> upper case
Fixed locally.
-Vitor
More information about the ffmpeg-devel
mailing list