[FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2
James Darnley
james.darnley at gmail.com
Mon Mar 4 22:10:32 EET 2019
On 2019-03-01 18:41, Michael Stoner wrote:
> The AVX2 code leverages VPERMD to process 12 pixels/iteration. This is my first patch submission so any comments are greatly appreciated.
>
> -Mike
>
> Tested on Skylake (Win32 & Win64)
> 1920x1080 input frame
> =====================
> C code - 440 fps
> SSSE3 - 920 fps
> AVX - 930 fps
> AVX2 - 1040 fps
>
> Regression tested at 1920x1080, 1280x720, and 352x288
> .loop:
> %ifidn %1, unaligned
> - movu m0, [r0]
> + movu m0, [r0] ; yB v5 yA u5 y9 v4 y8 u4 y7 v3 y6 u3 y5 v2 y4 u2 y3 v1 y2 u1 y1 v0 y0 u0
> %else
> mova m0, [r0]
> %endif
At first I didn't understand why you do so much seemingly unnecessary
work. You don't change how the data loaded into register. After more
in-depth reading I see now that you shuffle data around just so you can
store the data with a single move for each plane. The chroma is below.
> +%if cpuflag(avx2)
> + vpermd m1, m6, m1 ; 00 v5 v4 v3 00 v2 v1 v0 00 u5 u4 u3 00 u2 u1 u0
> + pshufb m1, m7 ; 00 00 v5 v4 v3 v2 v1 v0 00 00 u5 u4 u3 u2 u1 u0
> + movu [r2+r4], xm1
> + vextracti128 [r3+r4], m1, 1
> +%else
> movq [r2+r4], m1
> movhps [r3+r4], m1
> +%endif
Sounds commendable but I doubt the use of this many more shuffles gets
you much over a naive AVX2 version (where you treat the high half of ymm
like an unroll).
> +; for AVX2 version only
> +v210_luma_permute: dd 0,1,2,4,5,6,7,7
> +v210_chroma_permute: dd 0,1,4,5,2,3,6,7
Are you sure these can't be replaced with vpermq and its immediate
operand? It really looks like the second could be. It'll save you a
register.
> - mova m3, [v210_mult]
> - mova m4, [v210_mask]
> - mova m5, [v210_luma_shuf]
> - mova m6, [v210_chroma_shuf]
> + mova m3, [v210_luma_shuf]
> + mova m4, [v210_chroma_shuf1]
> +
> +%if cpuflag(avx2)
> + mova m5, [v210_luma_permute] ; VPERMD constant must be in a register
> + mova m6, [v210_chroma_permute] ; VPERMD constant must be in a register
> + mova m7, [v210_chroma_shuf2]
> +%endif
> +
> +%if ARCH_X86_64
> + mova m8, [v210_mult]
> + mova m9, [v210_mask]
> +%endif
> +
It would let you clean this up a bit.
My suggestion is to make the diff minimal by keeping the existing uses
and if you still need more than 8 registers for avx2 then make it
available for x86-64 only.
Compare yours with the one I committed here
https://github.com/Upipe/upipe/blob/master/lib/upipe-v210/v210dec.asm#L45
which is just FFmpeg's cleaned up a little plus avx2. I'm surprised
it's not already in FFmpeg.
You should do whatever is faster.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20190304/d47eda01/attachment.sig>
More information about the ffmpeg-devel
mailing list