[FFmpeg-devel] [PATCH 1/4] lavc/aarch64: add HEVC add_residual NEON

Sun Jan 17 00:54:44 EET 2021

On Thu, 7 Jan 2021, Josh Dekker wrote:

> Signed-off-by: Josh Dekker <josh at itanimul.li>
> ---
> libavcodec/aarch64/Makefile               |   2 +
> libavcodec/aarch64/hevcdsp_add_res_neon.S | 298 ++++++++++++++++++++++
> libavcodec/aarch64/hevcdsp_init.c         |  59 +++++
> libavcodec/hevcdsp.c                      |   2 +
> libavcodec/hevcdsp.h                      |   1 +
> 5 files changed, 362 insertions(+)
> create mode 100644 libavcodec/aarch64/hevcdsp_add_res_neon.S
> create mode 100644 libavcodec/aarch64/hevcdsp_init.c

This one is pretty much equivalent to Reimar's patch. As his one goes on 
top of the ported IDCT, I think I'd prefer his version of it. But in any 
case, some comments below:

> diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
> index f6434e40da..4bdd554e7e 100644
> --- a/libavcodec/aarch64/Makefile
> +++ b/libavcodec/aarch64/Makefile
> @@ -17,6 +17,7 @@ OBJS-$(CONFIG_VP8DSP)                   += aarch64/vp8dsp_init_aarch64.o
> OBJS-$(CONFIG_AAC_DECODER)              += aarch64/aacpsdsp_init_aarch64.o \
>                                            aarch64/sbrdsp_init_aarch64.o
> OBJS-$(CONFIG_DCA_DECODER)              += aarch64/synth_filter_init.o
> +OBJS-$(CONFIG_HEVC_DECODER)             += aarch64/hevcdsp_init.o
> OBJS-$(CONFIG_OPUS_DECODER)             += aarch64/opusdsp_init.o
> OBJS-$(CONFIG_RV40_DECODER)             += aarch64/rv40dsp_init_aarch64.o
> OBJS-$(CONFIG_VC1DSP)                   += aarch64/vc1dsp_init_aarch64.o
> @@ -53,6 +54,7 @@ NEON-OBJS-$(CONFIG_VP8DSP)              += aarch64/vp8dsp_neon.o
> # decoders/encoders
> NEON-OBJS-$(CONFIG_AAC_DECODER)         += aarch64/aacpsdsp_neon.o
> NEON-OBJS-$(CONFIG_DCA_DECODER)         += aarch64/synth_filter_neon.o
> +NEON-OBJS-$(CONFIG_HEVC_DECODER)        += aarch64/hevcdsp_add_res_neon.o
> NEON-OBJS-$(CONFIG_OPUS_DECODER)        += aarch64/opusdsp_neon.o
> NEON-OBJS-$(CONFIG_VORBIS_DECODER)      += aarch64/vorbisdsp_neon.o
> NEON-OBJS-$(CONFIG_VP9_DECODER)         += aarch64/vp9itxfm_16bpp_neon.o       \
> diff --git a/libavcodec/aarch64/hevcdsp_add_res_neon.S b/libavcodec/aarch64/hevcdsp_add_res_neon.S
> new file mode 100644
> index 0000000000..4005366192
> --- /dev/null
> +++ b/libavcodec/aarch64/hevcdsp_add_res_neon.S
> @@ -0,0 +1,298 @@
> +/* -*-arm64-*-
> + *
> + * AArch64 NEON optimised add residual functions for HEVC decoding
> + *
> + * Copyright (c) 2020 Josh Dekker <josh at itanimul.li>

I believe this one is at least a bit inspired by the arm version, right? 
In that case it's probably customary to bring the original copyright 
along.

> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +.macro clip10 in1, in2, c1, c2
> +    smax \in1, \in1, \c1
> +    smax \in2, \in2, \c1
> +    smin \in1, \in1, \c2
> +    smin \in2, \in2, \c2
> +.endm
> +
> +function ff_hevc_add_residual_4x4_8_neon, export=1
> +    mov x3, x0

Please align the instructions and operands like in the existing assembly. 
Also, in addition to aligning those two columns, in aarch64 assembly I 
write myself, I also try to align the individual operands; for 
instructions that only take GPRs, I'd write things as "x0, x1, x2", to 
keep operands aligned the same for cases with registers >= x10, and for 
instructions that take vector registers, align everything to line up for 
the longest case register name, e.g. v31.16b. For SIMD loads, stores and 
other things, where things don't generally line up, I just try to make the 
code look pretty and consistent.

Also, as a matter of taste, I tend to write the lane specifications with 
lower case letters, i.e. .16b instead of 16B.

> +    ld1 {v0.S}[0], [x3], x2
> +    ld1 {v0.S}[1], [x3], x2
> +    ld1 {v1.S}[0], [x3], x2
> +    ld1 {v1.S}[1], [x3], x2
> +    ld1 { v2.8H-v3.8H}, [x1]
> +    ushll v4.8H, v0.8B, #0
> +    ushll v5.8H, v1.8B, #0

ushll #0 is uxtl

> +    add v6.8H, v4.8H, v2.8H
> +    add v7.8H, v5.8H, v3.8H

The arm version (and Reimar's) does saturated addition here, i.e. sqadd.

There's a bunch of other minor tuning one could suggest here, but I guess 
it applies equally to the existing arm versions (and it's unsure whether 
it does provide any measurable benefit at all), so I'll refrain from 
suggesting it here.

// Martin