[FFmpeg-devel] [PATCH] avfilter/vf_w3fdif: add x86 SIMD

Fri Oct 9 20:06:44 CEST 2015

On 10/9/2015 1:44 PM, Paul B Mahol wrote:
> +cglobal w3fdif_complex_low, 4, 7, 9, 0, work_line, in_lines_cur0, coef, linesize
> +    movq                  m3, [coefq]
> +    DEFINE_ARGS    work_line, in_lines_cur0, in_lines_cur1, linesize, offset, in_lines_cur2, in_lines_cur3
> +    SPLATW                m0, m3, 0
> +    SPLATW                m1, m3, 1
> +    SPLATW                m2, m3, 2
> +    SPLATW                m3, m3, 3
> +    SBUTTERFLY            wd, 0, 1, 7
> +    SBUTTERFLY            wd, 2, 3, 7

Looking at this again, m0 and m1 end up having the same data. And so do m2
and m3. No need for the sbutterfly to interleave the coeffs. You just splat
two of them per register.

movq   m0, [coefq+0]
pshufd m2, m0, q1111
SPLATD m0

And since you're saving two regs with this you can enable the function for
x86_32.

> +    mov              offsetq, 0
> +    mov       in_lines_cur3q, [in_lines_cur0q+gprsize*3]
> +    mov       in_lines_cur2q, [in_lines_cur0q+gprsize*2]
> +    mov       in_lines_cur1q, [in_lines_cur0q+gprsize]
> +    mov       in_lines_cur0q, [in_lines_cur0q]
> +
> +.loop
> +    movh                                   m4, [in_lines_cur0q+offsetq]
> +    movh                                   m5, [in_lines_cur1q+offsetq]
> +    pxor                                   m7, m7

You can zero this outside the loop without worrying about overwriting it.
It will be one pxor total instead of two per loop.

> +    punpcklbw                              m4, m7
> +    punpcklbw                              m5, m7
> +    SBUTTERFLY                             wd, 4, 5, 7

Use any free reg here and below for the fourth argument to avoid overwriting
the zeroed one.

> +    pmaddwd                                m4, m0
> +    pmaddwd                                m5, m1

Use m0 for both here, of course.

> +    movh                                   m6, [in_lines_cur2q+offsetq]
> +    movh                                   m8, [in_lines_cur3q+offsetq]
> +    pxor                                   m7, m7
> +    punpcklbw                              m6, m7
> +    punpcklbw                              m8, m7
> +    SBUTTERFLY                             wd, 6, 8, 7
> +    pmaddwd                                m6, m2
> +    pmaddwd                                m8, m3

And m2 here (or make it m1).

> +    paddd                                  m4, m6
> +    paddd                                  m5, m8
> +    mova               [work_lineq+offsetq*4], m4
> +    mova        [work_lineq+offsetq*4+mmsize], m5
> +    add                               offsetq, mmsize/2
> +    sub                             linesized, mmsize/2
> +    jg .loop
> +REP_RET

The same can be done for complex_high (even if it's not enough to get it
working on x86_32).