[FFmpeg-devel] [PATCH 06/10] avcodec/vc1: Arm 32-bit NEON deblocking filter fast paths

Wed Mar 30 16:03:57 EEST 2022

On Fri, 25 Mar 2022, Ben Avison wrote:

> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
> version can still outperform the NEON version in specific cases. The balance
> between different code paths is stream-dependent, but in practice the best
> case happens about 5% of the time, the worst case happens about 40% of the
> time, and the complexity of the remaining cases fall somewhere in between.
> Therefore, taking the average of the best and worst case timings is
> probably a conservative estimate of the degree by which the NEON code
> improves performance.
>
> vc1dsp.vc1_h_loop_filter4_bestcase_c: 19.0
> vc1dsp.vc1_h_loop_filter4_bestcase_neon: 48.5
> vc1dsp.vc1_h_loop_filter4_worstcase_c: 144.7
> vc1dsp.vc1_h_loop_filter4_worstcase_neon: 76.2
> vc1dsp.vc1_h_loop_filter8_bestcase_c: 41.0
> vc1dsp.vc1_h_loop_filter8_bestcase_neon: 75.0
> vc1dsp.vc1_h_loop_filter8_worstcase_c: 294.0
> vc1dsp.vc1_h_loop_filter8_worstcase_neon: 102.7
> vc1dsp.vc1_h_loop_filter16_bestcase_c: 54.7
> vc1dsp.vc1_h_loop_filter16_bestcase_neon: 130.0
> vc1dsp.vc1_h_loop_filter16_worstcase_c: 569.7
> vc1dsp.vc1_h_loop_filter16_worstcase_neon: 186.7
> vc1dsp.vc1_v_loop_filter4_bestcase_c: 20.2
> vc1dsp.vc1_v_loop_filter4_bestcase_neon: 47.2
> vc1dsp.vc1_v_loop_filter4_worstcase_c: 164.2
> vc1dsp.vc1_v_loop_filter4_worstcase_neon: 68.5
> vc1dsp.vc1_v_loop_filter8_bestcase_c: 43.5
> vc1dsp.vc1_v_loop_filter8_bestcase_neon: 55.2
> vc1dsp.vc1_v_loop_filter8_worstcase_c: 316.2
> vc1dsp.vc1_v_loop_filter8_worstcase_neon: 72.7
> vc1dsp.vc1_v_loop_filter16_bestcase_c: 62.2
> vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7
> vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5
> vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7
>
> Signed-off-by: Ben Avison <bavison at riscosopen.org>
> ---
> libavcodec/arm/vc1dsp_init_neon.c |  14 +
> libavcodec/arm/vc1dsp_neon.S      | 643 ++++++++++++++++++++++++++++++
> 2 files changed, 657 insertions(+)

> +@ VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of vertically-neighbouring blocks
> +@ On entry:
> +@   r0 -> top-left pel of lower block
> +@   r1 = row stride, bytes
> +@   r2 = PQUANT bitstream parameter
> +function ff_vc1_v_loop_filter8_neon, export=1
> +        sub             r3, r0, r1, lsl #2
> +        vldr            d0, .Lcoeffs
> +        vld1.32         {d1}, [r0], r1          @ P5
> +        vld1.32         {d2}, [r3], r1          @ P1
> +        vld1.32         {d3}, [r3], r1          @ P2
> +        vld1.32         {d4}, [r0], r1          @ P6
> +        vld1.32         {d5}, [r3], r1          @ P3
> +        vld1.32         {d6}, [r0], r1          @ P7

Oh btw - I presume these loads can be done with alignment? And same for 
some of the stores too? At least for some older cores, the alignment 
specifier helps a lot - so for 32 bit assembly, I try to add as much 
alignment specifiers as possible.

// Martin