[FFmpeg-devel] [PATCH] NEON FFT/IMDCT

Thu Sep 3 15:37:41 CEST 2009

On Sep 3, 2009, at 5:41 AM, Naotoshi Nojiri wrote:

> Hi,
>
> I tested the patch on Cortex-A8 @500MHz (BeagleBoard).
> FFT (fft-test -s):
> 440.8 -> 34.2 us/transform (12.9x speed up)
> IMDCT (fft-test -i -m -s):
> 142.4 -> 11.8 us/transform (12.1x speed up)
>
> I had written NEON intrinsics code a bit, but this is my first
> ARM/NEON code in assembly.
> So, any comments and suggestions would be appreciated.

> +__attribute__((noinline)) void ff_imdct_half_neon(MDCTContext *s,  
> FFTSample *output, const FFTSample *input)

av_noinline

> +fft4_neon:				// r0: FFTComplex *z
> +	vld1.32		{d16-d19}, [r0, :128] // q8{r0,i0,r1,i1} q9{r2,i2,r3,i3}
> +	vext.32		q9, q9, q9, #1
> +	vswp		d17, d18	// q8{r0,i0,i2,r3} q9{r1,i1,i3,r2}
> +	vadd.f32	q10, q8, q9	// {t1,t2,t5,t6}
> +	vsub.f32	q9, q8, q9	// {t3,t4,t7,t8}
> +	vrev64.32	d21, d21
> +	vswp		d21, d18	// q10{t1,t2,t3,t4} q9{t6,t5,t7,t8}
> +	vadd.f32	q8, q10, q9	// {r0,i0,r1,i1}
> +	vsub.f32	q9, q10, q9	// {r2,i2,r3,i3}
> +	vst1.32		{d16-d19}, [r0, :128]
> +	bx		lr

This sequence is very much latency-bound; vadd/vsub on d registers  
then vtrn.32 should be faster. On A8, most NEON floating point  
instructions aren't any faster to do on q registers as opposed to both  
d registers individually, so if it avoids some permutes to use 64-bit  
registers it'll probably be worth it.

> +function ff_fft_calc_neon, export=1
> +	ldr		r2, [r0]
> +	mov		r0, r1
> +	subs		r2, r2, #3
> +	blt		fft4_neon
> +
> +	push		{r4-r6, lr}
> +	movrel		r3, fft_dispatch_neon
> +	mov		lr, pc
> +	ldr		pc, [r3, r2, lsl #2]
> +	pop		{r4-r6, pc}

This causes a branch misprediction; always call functions with bl or  
blx.
Although this case looks like it can simply return directly from  
pass_neon.