[FFmpeg-devel] [PATCH 1/2] lavc/aarch64: new 8-bit hevc 16x16 idct

Tue Aug 9 15:15:05 EEST 2022

On Thu, 23 Jun 2022, J. Dekker wrote:

> old:
> hevc_idct_16x16_8_c: 5366.2
> hevc_idct_16x16_8_neon: 1493.2
>
> new:
> hevc_idct_16x16_8_c: 5363.2
> hevc_idct_16x16_8_neon: 943.5
>
> Co-developed-by: Rafal Dabrowa <fatwildcat at gmail.com>
> Signed-off-by: J. Dekker <jdek at itanimul.li>
> ---
> libavcodec/aarch64/hevcdsp_idct_neon.S    | 666 ++++++++++++++++++++++
> libavcodec/aarch64/hevcdsp_init_aarch64.c |   3 +-
> 2 files changed, 668 insertions(+), 1 deletion(-)

Throughout the new code, you have e.g. "add x5, x5, x4, lsl 2", where the 
"lsl 2" breaks assembling with MS armasm64 - it's missing the '#' on the 
constant 2.

Also, for loads/stores, it seems to be missing the same '#' for 
postincrement, e.g. "ld1 {v16.8h, v17.8h, v18.8h, v19.8h}, [x4], 64". Also 
"mov x4, 64". Apparently armasm64 doesn't have a problem with that, but it 
would still be good to have it consistent with the rest.

> This idct is significantly faster than the one we currently have, I
> suspect its for a couple reasons: 1) it's only written for 8bit

I don't see how that would change anything? Isn't the only thing that 
differs between 8 and 10/12 bit in the existing implementation about how 
much to scale down at the end? All other intermediate values are the same 
size?

> 2) it's unrolled signficantly more. It comes at a hefty cost of roughly 
> 2.25x the object size.

If by that, you mean that the existing code works on 4 elements at a time 
(i.e. mostly operating on .4h vectors), while this one operates on .8h 
vectors, then yes, that's most probably the biggest source of the speedup 
(even if a lot of the intermediate stuff happens in .4s vectors). The 
existing code was ported from the 32 bit arm version (which probably had 
to stick to 4 elements at a time due to register availability there), 
while it probably could have been made double width when it was ported to 
64 bit.

> I'm wondering if this idct is salvagable, or the one we have should just 
> be improved instead.

Well, my honest opinion is:

- I don't quite understand the current code (I've worked on the 
vp8/vp9/av1 IDCTs a fair amount, but the HEVC one seems to be different 
enough that I don't recognize all the concepts here.

- The current implementation would need to be reformatted if kept

- The current implementation does have some rather clear high level 
structure though, e.g. when looking at the idct_16x16 macro.

- The new implementation seems to be just one huuuuge function. If you 
know it by heart, it's probably good, but it's really hard to get an 
overview of if you're not familiar with the HEVC IDCTs.

As for steps forward:
- Is it possible to widen the existing implementation to operate on 8 
elements instead of 4? I think that would bring it up to par with this 
one.
- Can you get some high level structure to the new implementation so that 
it becomes understandable? Either lots of more comments explaining what's 
happening and why, or splitting it up in smaller macros.

Some more comments on the code itself below:

> +// void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit)
> +function ff_hevc_idct_16x16_8_neon_new, export=1
> +        sub             sp, sp, 64
> +        st1             {v8.16b, v9.16b, v10.16b, v11.16b}, [sp]
> +        sub             sp, sp, 32
> +        st1             {v14.16b, v15.16b}, [sp]
> +        mov             x3, 0
> +        mov             x2, x0
> +1:      mov             x4, x2
> +        mov             x5, 32
> +        ld1             {v16.8h}, [x4], x5
> +        ld1             {v17.8h}, [x4], x5
> +        ld1             {v18.8h}, [x4], x5
> +        ld1             {v19.8h}, [x4], x5
> +        ld1             {v20.8h}, [x4], x5
> +        ld1             {v21.8h}, [x4], x5
> +        ld1             {v22.8h}, [x4], x5
> +        ld1             {v23.8h}, [x4], x5
> +        ld1             {v24.8h}, [x4], x5
> +        ld1             {v25.8h}, [x4], x5
> +        ld1             {v26.8h}, [x4], x5
> +        ld1             {v27.8h}, [x4], x5
> +        ld1             {v28.8h}, [x4], x5
> +        ld1             {v29.8h}, [x4], x5
> +        ld1             {v30.8h}, [x4], x5
> +        ld1             {v31.8h}, [x4], x5
> +        cmp             x1, 12
> +        b.hs            5f
> +        // limit2 below 16
> +        bic             x4, x1, 1
> +        adr             x5, .LimitMask
> +        cbnz            x3, 3f
> +        // columns 0 .. 7 - cleanup of indexes 5 .. 7
> +        ld1             {v0.8h}, [x5]
> +        adr             x5, 2f
> +        add             x5, x5, x4, lsl 2
> +        add             x5, x5, x4, lsl 1
> +        br              x5
> +2:      and             v17.16b, v17.16b, v0.16b    // col_limit 0..1 -> limit2 == 4..5
> +        and             v19.16b, v19.16b, v0.16b
> +        b               5f

I don't really know what these jump tables do and how it corresponds to 
things in the existing implementation - but I guess that can be one part 
of what makes things faster too.

The existing implementation does an 16x16 transform by first doing 4x 
transforms for an 4x16 piece of data, transpose that, then do another 4x 
4x16 for the second pass. How does the new implementation do it?

If I understand correctly, the old implementation didn't take col_limit 
into account at all. Can that be one part of what makes things faster - or 
is that only something that makes a difference in real use but not in 
checkasm benchmarks?

// Martin