[FFmpeg-devel] [PATCH 1/2] lavc/aarch64: new 8-bit hevc 16x16 idct
Martin Storsjö
martin at martin.st
Tue Aug 9 15:15:05 EEST 2022
On Thu, 23 Jun 2022, J. Dekker wrote:
> old:
> hevc_idct_16x16_8_c: 5366.2
> hevc_idct_16x16_8_neon: 1493.2
>
> new:
> hevc_idct_16x16_8_c: 5363.2
> hevc_idct_16x16_8_neon: 943.5
>
> Co-developed-by: Rafal Dabrowa <fatwildcat at gmail.com>
> Signed-off-by: J. Dekker <jdek at itanimul.li>
> ---
> libavcodec/aarch64/hevcdsp_idct_neon.S | 666 ++++++++++++++++++++++
> libavcodec/aarch64/hevcdsp_init_aarch64.c | 3 +-
> 2 files changed, 668 insertions(+), 1 deletion(-)
Throughout the new code, you have e.g. "add x5, x5, x4, lsl 2", where the
"lsl 2" breaks assembling with MS armasm64 - it's missing the '#' on the
constant 2.
Also, for loads/stores, it seems to be missing the same '#' for
postincrement, e.g. "ld1 {v16.8h, v17.8h, v18.8h, v19.8h}, [x4], 64". Also
"mov x4, 64". Apparently armasm64 doesn't have a problem with that, but it
would still be good to have it consistent with the rest.
> This idct is significantly faster than the one we currently have, I
> suspect its for a couple reasons: 1) it's only written for 8bit
I don't see how that would change anything? Isn't the only thing that
differs between 8 and 10/12 bit in the existing implementation about how
much to scale down at the end? All other intermediate values are the same
size?
> 2) it's unrolled signficantly more. It comes at a hefty cost of roughly
> 2.25x the object size.
If by that, you mean that the existing code works on 4 elements at a time
(i.e. mostly operating on .4h vectors), while this one operates on .8h
vectors, then yes, that's most probably the biggest source of the speedup
(even if a lot of the intermediate stuff happens in .4s vectors). The
existing code was ported from the 32 bit arm version (which probably had
to stick to 4 elements at a time due to register availability there),
while it probably could have been made double width when it was ported to
64 bit.
> I'm wondering if this idct is salvagable, or the one we have should just
> be improved instead.
Well, my honest opinion is:
- I don't quite understand the current code (I've worked on the
vp8/vp9/av1 IDCTs a fair amount, but the HEVC one seems to be different
enough that I don't recognize all the concepts here.
- The current implementation would need to be reformatted if kept
- The current implementation does have some rather clear high level
structure though, e.g. when looking at the idct_16x16 macro.
- The new implementation seems to be just one huuuuge function. If you
know it by heart, it's probably good, but it's really hard to get an
overview of if you're not familiar with the HEVC IDCTs.
As for steps forward:
- Is it possible to widen the existing implementation to operate on 8
elements instead of 4? I think that would bring it up to par with this
one.
- Can you get some high level structure to the new implementation so that
it becomes understandable? Either lots of more comments explaining what's
happening and why, or splitting it up in smaller macros.
Some more comments on the code itself below:
> +// void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit)
> +function ff_hevc_idct_16x16_8_neon_new, export=1
> + sub sp, sp, 64
> + st1 {v8.16b, v9.16b, v10.16b, v11.16b}, [sp]
> + sub sp, sp, 32
> + st1 {v14.16b, v15.16b}, [sp]
> + mov x3, 0
> + mov x2, x0
> +1: mov x4, x2
> + mov x5, 32
> + ld1 {v16.8h}, [x4], x5
> + ld1 {v17.8h}, [x4], x5
> + ld1 {v18.8h}, [x4], x5
> + ld1 {v19.8h}, [x4], x5
> + ld1 {v20.8h}, [x4], x5
> + ld1 {v21.8h}, [x4], x5
> + ld1 {v22.8h}, [x4], x5
> + ld1 {v23.8h}, [x4], x5
> + ld1 {v24.8h}, [x4], x5
> + ld1 {v25.8h}, [x4], x5
> + ld1 {v26.8h}, [x4], x5
> + ld1 {v27.8h}, [x4], x5
> + ld1 {v28.8h}, [x4], x5
> + ld1 {v29.8h}, [x4], x5
> + ld1 {v30.8h}, [x4], x5
> + ld1 {v31.8h}, [x4], x5
> + cmp x1, 12
> + b.hs 5f
> + // limit2 below 16
> + bic x4, x1, 1
> + adr x5, .LimitMask
> + cbnz x3, 3f
> + // columns 0 .. 7 - cleanup of indexes 5 .. 7
> + ld1 {v0.8h}, [x5]
> + adr x5, 2f
> + add x5, x5, x4, lsl 2
> + add x5, x5, x4, lsl 1
> + br x5
> +2: and v17.16b, v17.16b, v0.16b // col_limit 0..1 -> limit2 == 4..5
> + and v19.16b, v19.16b, v0.16b
> + b 5f
I don't really know what these jump tables do and how it corresponds to
things in the existing implementation - but I guess that can be one part
of what makes things faster too.
The existing implementation does an 16x16 transform by first doing 4x
transforms for an 4x16 piece of data, transpose that, then do another 4x
4x16 for the second pass. How does the new implementation do it?
If I understand correctly, the old implementation didn't take col_limit
into account at all. Can that be one part of what makes things faster - or
is that only something that makes a difference in real use but not in
checkasm benchmarks?
// Martin
More information about the ffmpeg-devel
mailing list