[FFmpeg-devel] [PATCH] RFC: v210enc optimisations and initial AVX-512
Henrik Gramner
henrik at gramner.com
Fri Oct 21 16:57:54 EEST 2022
On Fri, Oct 21, 2022 at 5:41 AM Kieran Kunhya <kierank at obe.tv> wrote:
>
> Hi,
>
> Please see attached an attempt to optimise the 8-bit input to v210enc to
> reduce the number of shuffles.
> This comes at the cost of having to extract the middle element and perform
> a DWORD shift on it and then reinserting it.
> I have added a few comments but any other ideas are welcome.
Random untested idea:
A: db 32, 0, 48, -1, 1, 33, 2, -1, 49, 3, 34, -1, 4, 50, 5, -1
db 35, 6, 51, -1, 7, 36, 8, -1, 52, 9, 37, -1, 10, 53, 11, -1
db 38, 12, 54, -1, 13, 39, 14, -1, 55, 15, 40, -1, 16, 56, 17, -1
db 41, 18, 57, -1, 19, 42, 20, -1, 58, 21, 43, -1, 22, 59, 23, -1
B: db 1, 0, 16, 0
C: dd 0x0003fc00
[...]
mova m2, [A]
vpbroadcastd m3, [B]
vpbroadcastd m6, [C]
[...]
.loop:
movu ym1, [yq]
vinserti32x4 m1, [uq], 2
vinserti32x4 m1, [vq], 3
CLIPUB m1, m4, m5
vpermb m1, m2, m1
pmaddubsw m0, m1, m3
pslld m1, 2
vpternlogd m0, m1, m6, 0xca
movu [dstq], m0
I guess it could also be scaled to ymm if you're a big Skylake fan :P
(in which case you'd probably want to reorder the shuffle indices so
that chroma comes first, i.e. movq [u] + movhps [v] + vinserti32x4
[y])
More information about the ffmpeg-devel
mailing list