[FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2

Mon Mar 4 22:10:32 EET 2019

On 2019-03-01 18:41, Michael Stoner wrote:
> The AVX2 code leverages VPERMD to process 12 pixels/iteration.  This is my first patch submission so any comments are greatly appreciated.
> 
> -Mike
> 
> Tested on Skylake (Win32 & Win64)
> 1920x1080 input frame
> =====================
> C code - 440 fps
> SSSE3  - 920 fps
> AVX    - 930 fps
> AVX2   - 1040 fps
> 
> Regression tested at 1920x1080, 1280x720, and 352x288

>  .loop:
>  %ifidn %1, unaligned
> -    movu   m0, [r0]
> +    movu   m0, [r0]                    ; yB v5 yA  u5 y9 v4  y8 u4 y7  v3 y6 u3  y5 v2 y4  u2 y3 v1  y2 u1 y1  v0 y0 u0
>  %else
>      mova   m0, [r0]
>  %endif

At first I didn't understand why you do so much seemingly unnecessary
work.  You don't change how the data loaded into register.  After more
in-depth reading I see now that you shuffle data around just so you can
store the data with a single move for each plane.  The chroma is below.

> +%if cpuflag(avx2)
> +    vpermd m1, m6, m1                  ; 00 v5 v4 v3 00 v2 v1 v0 00 u5 u4 u3 00 u2 u1 u0
> +    pshufb m1, m7                      ; 00 00 v5 v4 v3 v2 v1 v0 00 00 u5 u4 u3 u2 u1 u0
> +    movu   [r2+r4], xm1
> +    vextracti128 [r3+r4], m1, 1
> +%else
>      movq   [r2+r4], m1
>      movhps [r3+r4], m1
> +%endif

Sounds commendable but I doubt the use of this many more shuffles gets
you much over a naive AVX2 version (where you treat the high half of ymm
like an unroll).

> +; for AVX2 version only
> +v210_luma_permute: dd 0,1,2,4,5,6,7,7
> +v210_chroma_permute: dd 0,1,4,5,2,3,6,7

Are you sure these can't be replaced with vpermq and its immediate
operand?  It really looks like the second could be.  It'll save you a
register.

> -    mova   m3, [v210_mult]
> -    mova   m4, [v210_mask]
> -    mova   m5, [v210_luma_shuf]
> -    mova   m6, [v210_chroma_shuf]
> +    mova   m3, [v210_luma_shuf]
> +    mova   m4, [v210_chroma_shuf1]
> +
> +%if cpuflag(avx2)
> +    mova   m5, [v210_luma_permute]      ; VPERMD constant must be in a register
> +    mova   m6, [v210_chroma_permute]    ; VPERMD constant must be in a register
> +    mova   m7, [v210_chroma_shuf2]
> +%endif
> +
> +%if ARCH_X86_64
> +    mova   m8, [v210_mult]
> +    mova   m9, [v210_mask]
> +%endif
> +

It would let you clean this up a bit.

My suggestion is to make the diff minimal by keeping the existing uses
and if you still need more than 8 registers for avx2 then make it
available for x86-64 only.

Compare yours with the one I committed here
https://github.com/Upipe/upipe/blob/master/lib/upipe-v210/v210dec.asm#L45
which is just FFmpeg's cleaned up a little plus avx2.  I'm surprised
it's not already in FFmpeg.

You should do whatever is faster.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20190304/d47eda01/attachment.sig>