[FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup.

Tue Nov 17 19:50:42 EET 2020

On Mon, Nov 16, 2020 at 11:03 AM Alan Kelly
<alankelly-at-google.com at ffmpeg.org> wrote:
> +cglobal yuv2yuvX, 6, 7, 16, filter, filterSize, dest, dstW, dither, offset, src
Only 8 xmm registers are used, so 8 should be used instead of 16 here.
Otherwise it causes unnecessary spilling of registers on 64-bit
Windows.

> +%if ARCH_X86_64
> +%define ptr_size 8
[...]
> +%else
> +%define ptr_size 4
The predefined variable gprsize already exists for this purpose, so
that can be used instead.

> +    movq                 xmm3, [ditherq]
If vpbroadcastq m3, [ditherq] is used for AVX2 here, then the following
> +    vperm2i128           m3, m3, m3, 0
instruction can be eliminated.

> +    punpcklwd            m1, m1
> +    punpckldq            m1, m1
Can be replaced with pshuflw m1, m1, q0000

>+    mov                  srcq, [filterSizeq]
>+    test                 srcd, srcd
test srcq, srcq should be used here, since the lower 32 bits of a
valid pointer could randomly happen to be zero on a 64-bit system.

> +    REP_RET
Since non-temporal stores are being used, this should be replaced with
    sfence
    RET
to guarantee proper memory ordering semantics in multi-threaded use
cases. Things will usually work fine without it, but may potentially
break in "fun to debug" ways.