[FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

Martin Storsjö martin at martin.st
Wed Mar 30 17:14:37 EEST 2022


On Fri, 25 Mar 2022, Ben Avison wrote:

> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
>
> idctdsp.add_pixels_clamped_c: 323.0
> idctdsp.add_pixels_clamped_neon: 41.5
> idctdsp.put_pixels_clamped_c: 243.0
> idctdsp.put_pixels_clamped_neon: 30.0
> idctdsp.put_signed_pixels_clamped_c: 225.7
> idctdsp.put_signed_pixels_clamped_neon: 37.7
>
> Signed-off-by: Ben Avison <bavison at riscosopen.org>
> ---
> libavcodec/aarch64/Makefile               |   3 +-
> libavcodec/aarch64/idctdsp_init_aarch64.c |  26 +++--
> libavcodec/aarch64/idctdsp_neon.S         | 130 ++++++++++++++++++++++
> 3 files changed, 150 insertions(+), 9 deletions(-)
> create mode 100644 libavcodec/aarch64/idctdsp_neon.S

Generally LGTM

> +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128)
> +// On entry:
> +//   x0 -> array of 64x 16-bit coefficients
> +//   x1 -> 8-bit results
> +//   x2 = row stride for results, bytes
> +function ff_put_signed_pixels_clamped_neon, export=1
> +        ld1             {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64
> +        movi            v4.8b, #128
> +        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x0]
> +        sqxtn           v0.8b, v0.8h
> +        sqxtn           v1.8b, v1.8h
> +        sqxtn           v2.8b, v2.8h
> +        sqxtn           v3.8b, v3.8h
> +        sqxtn           v5.8b, v16.8h
> +        add             v0.8b, v0.8b, v4.8b

Here you could save 4 add instructions with sqxtn2 and adding .16b 
vectors, but I'm not sure if it's wortwhile. (It reduces the checkasm 
numbers by 0.7 for Cortex A72, by 0.3 for A73, but increases the runtime 
by 1.0 on A53.) Stranegely enough, I get much smaller numbers on my A72 
than you got. I get these:

idctdsp.add_pixels_clamped_c: 306.7
idctdsp.add_pixels_clamped_neon: 25.7
idctdsp.put_pixels_clamped_c: 217.2
idctdsp.put_pixels_clamped_neon: 15.2
idctdsp.put_signed_pixels_clamped_c: 216.7
idctdsp.put_signed_pixels_clamped_neon: 19.2

(The _c numbers are of course highly compiler dependent, but the assembly 
numbers should generally match quite closely. And AFAIK they should be 
measured in clock cycles, so CPU frequency shouldn't really play a role 
either.)

// Martin



More information about the ffmpeg-devel mailing list