[FFmpeg-devel] [PATCH v3] mdct15: add assembly optimizations for the 15-point FFT

Fri Jun 23 03:44:44 EEST 2017

On Fri, Jun 23, 2017 at 12:44 AM, Rostislav Pehlivanov
<atomnuker at gmail.com> wrote:
> +%macro FFT5 3 ; %1 - in_offset, %2 - dst1 (64bit used), %3 - dst2
> +    movddup xm0, [inq + 0*16 +  0 + %1] ; in[ 0].re, in[ 0].im, in[ 0].re, in[ 0].im
> +    movsd   xm1, [inq + 1*16 +  8 + %1] ; in[ 3].re, in[ 3].im,         0,         0
> +    movsd   xm2, [inq + 2*16 + 16 + %1] ; in[ 5].re, in[ 5].im, in[ 6].re, in[ 6].im
> +    movsd   xm3, [inq + 4*16 +  8 + %1] ; in[ 8].re, in[ 8].im, in[ 9].re, in[ 9].im
> +    movsd   xm4, [inq + 6*16 +  0 + %1] ; in[12].re, in[12].im,         0,         0
> +
> +    vinsertf128  m0,  xm0, 1
> +
> +    shufps      xm1,  xm2, q1010        ; in[ 3].re, in[ 3].im, in[ 6].re, in[ 6].im
> +    shufps      xm4,  xm3, q1010        ; in[12].re, in[12].im, in[ 9].re, in[ 9].im

vbroadcastsd instead of movddup + vinsertf128.
movhps instead of movsd+shufps.

> +%macro BUTTERFLIES_DC 2 ; %1 - exptab_offset, %2 - out
> +    movaps m0, [exptabq + %1]
> +    vextractf128 xm1, m0, 1
> +
> +    mulps   xm1, xm10
> +    mulps   xm0, xm9

mulps xm0, xm9, [exptabq + %1]
mulps xm1, xm10, [exptabq + %1 + 16]

(cross-lane shuffles are slow, avoid them when possible)

> +%macro BUTTERFLIES_AC 2 ; exptab, exptab_offset, src1, src2, src3, out (uses m0-m3)
> +    mulps m0, m12, [exptabq + 64*0 + 0*mmsize + %1]
> +    mulps m1, m12, [exptabq + 64*0 + 1*mmsize + %1]
> +    mulps m2, m13, [exptabq + 64*1 + 0*mmsize + %1]
> +    mulps m3, m13, [exptabq + 64*1 + 1*mmsize + %1]
> +
> +    shufps m1, m1, q2301
> +    shufps m3, m3, q2301
> +
> +    addps m0, m1
> +    addps m2, m3
> +    addps m0, m2

Adding m1 and m3 before shuffling should allow you to remove one
shufps. Might also be beneficial to reorder the multiplies so that m1
and m3 are calculated before m0 and m2.

> +cglobal fft15, 4, 6, 14, out, in, exptab, stride, stride3, stride5
> +%define out0q inq
> +    shl strideq, 3
> +
> +    movaps m5, [exptabq + 480]
> +    vextractf128 xm6, m5, 1

Use two loads instead of a cross-lane shuffle.