[FFmpeg-devel] [PATCH] libavcodec/riscv:add RVV optimized for idct_32x32_8:

Tue Apr 15 18:02:52 EEST 2025

Hi,

Le tiistaina 15. huhtikuuta 2025, 10.34.24 Itä-Euroopan kesäaika 
daichengrong at iscas.ac.cn a écrit :
> From: daichengrong <daichengrong at iscas.ac.cn>
> 
>      riscv/hevcdsp_idct_rvv: Optimize idct_32x32_8
> 
>      On Banana PI F3:
> 
>      hevc_idct_32x32_8_c:                                119579.3 ( 1.00x)
>      hevc_idct_32x32_8_rvv_i64:                           51254.4 ( 2.33x)
> 
> Signed-off-by: daichengrong <daichengrong at iscas.ac.cn>
> ---
>  libavcodec/riscv/Makefile           |    1 +
>  libavcodec/riscv/hevcdsp_idct_rvv.S | 1042 +++++++++++++++++++++++++++
>  libavcodec/riscv/hevcdsp_init.c     |   52 +-
>  3 files changed, 1075 insertions(+), 20 deletions(-)
>  create mode 100644 libavcodec/riscv/hevcdsp_idct_rvv.S
> 
> diff --git a/libavcodec/riscv/Makefile b/libavcodec/riscv/Makefile
> index a80d2fa2e7..dfc33afbee 100644
> --- a/libavcodec/riscv/Makefile
> +++ b/libavcodec/riscv/Makefile
> @@ -36,6 +36,7 @@ RVV-OBJS-$(CONFIG_H264DSP) += riscv/h264addpx_rvv.o
> riscv/h264dsp_rvv.o \ OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_init.o
>  RVV-OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_rvv.o
>  OBJS-$(CONFIG_HEVC_DECODER) += riscv/hevcdsp_init.o
> +OBJS-$(CONFIG_HEVC_DECODER) += riscv/hevcdsp_idct_rvv.o
>  RVV-OBJS-$(CONFIG_HEVC_DECODER)  += riscv/h26x/h2656_inter_rvv.o
>  OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_init.o
>  RVV-OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_rvv.o
> diff --git a/libavcodec/riscv/hevcdsp_idct_rvv.S
> b/libavcodec/riscv/hevcdsp_idct_rvv.S new file mode 100644
> index 0000000000..f8dd2e5bf4
> --- /dev/null
> +++ b/libavcodec/riscv/hevcdsp_idct_rvv.S
> @@ -0,0 +1,1042 @@
> +/*
> + * Copyright (c) 2025 Institue of Software Chinese Academy of Sciences
> (ISCAS). + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA + */
> +
> +#include "libavutil/riscv/asm.S"
> +
> +const trans, align=4
> +        .2byte          64, 83, 64, 36
> +        .2byte          89, 75, 50, 18
> +        .2byte          90, 87, 80, 70
> +        .2byte          57, 43, 25, 9
> +        .2byte          90, 90, 88, 85
> +        .2byte          82, 78, 73, 67
> +        .2byte          61, 54, 46, 38
> +        .2byte          31, 22, 13, 4
> +endconst
> +
> +.macro sum_sub out, in, c, op, p
> +        vsetivli	t0, 4, e16, mf2, tu, ma

I tihnk that you don't need t0 here? Ditto below.

> +  .ifc \op, +
> +        .ifc \p, 2
> +                vslidedown.vi	v8, \in, 4
> +                vwmacc.vx	\out, \c, v8
> +        .else
> +                vwmacc.vx	\out, \c, \in
> +        .endif
> +  .else
> +        .ifc \p, 2
> +                neg	\c, \c
> +                vslidedown.vi	v8, \in, 4
> +                vwmacc.vx	\out, \c, v8
> +                neg	\c, \c
> +        .else
> +                neg	\c, \c
> +                vwmacc.vx	\out, \c, \in
> +                neg	\c, \c

The typical problem with complex nested macros like this is, you easily end up 
assembling very inefficient code.

For instance, this keeps vainly flipping the sign of the same value over and 
over only to allow this macro to exist.

> +        .endif
> +  .endif
> +.endm
> +
> +.macro add_member32 in, t0, index0, t1, index1, t2, index2, t3, index3,
> op0, op1, op2, op3, p
> +        vsetivli	t0, 1, e16, m1, tu, ma
> +        vslidedown.vi	v12, \t0, \index0
> +        vmv.x.s	s2, v12
> +        vslidedown.vi	v12, \t1, \index1
> +        vmv.x.s	s3, v12
> +        vslidedown.vi	v12, \t2, \index2
> +        vmv.x.s	s4, v12
> +        vslidedown.vi	v12, \t3, \index3
> +        vmv.x.s	s5, v12

This is a very inefficient way to extract 4 scalars out of a vector. I'm not 
familar with the overall specific algorithm, but I would expect that this can 
be avoided. At least, I have never seen need for such construct to implement a 
DCT. And we already have quite a few DCTs in the FFmpeg RISC-V port.

Admittedly for smaller matrices than 32x32. But typically larger matrices are 
not as big a deal with RVV as they are on Arm or x86. RVV requires spilling to 
the intermediate values from the first DCT dimension to memory, to transpose 
them before the second DCT dimension. That being the case, the penalty for not 
fitting the entire matrix in the vector register bank is comparatively much 
smaller.

And yes, please don't use slides for transposition. It's horribly complicated 
and almost certainly slower than spilling to stack and using strided loads/
stores, for any non-trivial matrix size.

-- 
Rémi Denis-Courmont
Tapiolan uusi kaupunki, Uudenmaan entinen Suomen tasavalta