[FFmpeg-devel] [PATCH v3 5/5] avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON

Thu Apr 4 16:01:01 EEST 2024

On Tue, 2 Apr 2024, Geoff Hill wrote:

> Signed-off-by: Geoff Hill <geoff at geoffhill.org>
> ---
> libavcodec/aarch64/ac3dsp_init_aarch64.c |  5 ++++
> libavcodec/aarch64/ac3dsp_neon.S         | 35 ++++++++++++++++++++++++
> tests/checkasm/ac3dsp.c                  | 26 ++++++++++++++++++
> 3 files changed, 66 insertions(+)
>
> diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
> index fa8fcf2e47..4a78ec0b2a 100644
> --- a/libavcodec/aarch64/ac3dsp_neon.S
> +++ b/libavcodec/aarch64/ac3dsp_neon.S
> @@ -88,3 +88,38 @@ function ff_ac3_sum_square_butterfly_int32_neon, export=1
>         st1         {v0.1d-v3.1d}, [x0]
> 1:      ret
> endfunc
> +
> +function ff_ac3_sum_square_butterfly_float_neon, export=1
> +        cbz         w3, 1f
> +        movi        v0.4s, #0
> +        movi        v1.4s, #0
> +        movi        v2.4s, #0
> +        movi        v3.4s, #0
> +0:      ld1         {v30.4s}, [x1], #16
> +        ld1         {v31.4s}, [x2], #16
> +        fadd        v16.4s, v30.4s, v31.4s
> +        fsub        v17.4s, v30.4s, v31.4s
> +        fmul        v30.4s, v30.4s, v30.4s
> +        fadd        v0.4s, v0.4s, v30.4s

The arm version here used vmla instead of separate vmul+vadd - is there 
any reason why we can't use fmla here?

// Martin