[FFmpeg-devel] [PATCH] libavcodec/riscv:add RVV optimized for idct_32x32_8:
Rémi Denis-Courmont
remi at remlab.net
Tue Apr 15 18:02:52 EEST 2025
Hi,
Le tiistaina 15. huhtikuuta 2025, 10.34.24 Itä-Euroopan kesäaika
daichengrong at iscas.ac.cn a écrit :
> From: daichengrong <daichengrong at iscas.ac.cn>
>
> riscv/hevcdsp_idct_rvv: Optimize idct_32x32_8
>
> On Banana PI F3:
>
> hevc_idct_32x32_8_c: 119579.3 ( 1.00x)
> hevc_idct_32x32_8_rvv_i64: 51254.4 ( 2.33x)
>
> Signed-off-by: daichengrong <daichengrong at iscas.ac.cn>
> ---
> libavcodec/riscv/Makefile | 1 +
> libavcodec/riscv/hevcdsp_idct_rvv.S | 1042 +++++++++++++++++++++++++++
> libavcodec/riscv/hevcdsp_init.c | 52 +-
> 3 files changed, 1075 insertions(+), 20 deletions(-)
> create mode 100644 libavcodec/riscv/hevcdsp_idct_rvv.S
>
> diff --git a/libavcodec/riscv/Makefile b/libavcodec/riscv/Makefile
> index a80d2fa2e7..dfc33afbee 100644
> --- a/libavcodec/riscv/Makefile
> +++ b/libavcodec/riscv/Makefile
> @@ -36,6 +36,7 @@ RVV-OBJS-$(CONFIG_H264DSP) += riscv/h264addpx_rvv.o
> riscv/h264dsp_rvv.o \ OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_init.o
> RVV-OBJS-$(CONFIG_H264QPEL) += riscv/h264qpel_rvv.o
> OBJS-$(CONFIG_HEVC_DECODER) += riscv/hevcdsp_init.o
> +OBJS-$(CONFIG_HEVC_DECODER) += riscv/hevcdsp_idct_rvv.o
> RVV-OBJS-$(CONFIG_HEVC_DECODER) += riscv/h26x/h2656_inter_rvv.o
> OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_init.o
> RVV-OBJS-$(CONFIG_HUFFYUV_DECODER) += riscv/huffyuvdsp_rvv.o
> diff --git a/libavcodec/riscv/hevcdsp_idct_rvv.S
> b/libavcodec/riscv/hevcdsp_idct_rvv.S new file mode 100644
> index 0000000000..f8dd2e5bf4
> --- /dev/null
> +++ b/libavcodec/riscv/hevcdsp_idct_rvv.S
> @@ -0,0 +1,1042 @@
> +/*
> + * Copyright (c) 2025 Institue of Software Chinese Academy of Sciences
> (ISCAS). + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA + */
> +
> +#include "libavutil/riscv/asm.S"
> +
> +const trans, align=4
> + .2byte 64, 83, 64, 36
> + .2byte 89, 75, 50, 18
> + .2byte 90, 87, 80, 70
> + .2byte 57, 43, 25, 9
> + .2byte 90, 90, 88, 85
> + .2byte 82, 78, 73, 67
> + .2byte 61, 54, 46, 38
> + .2byte 31, 22, 13, 4
> +endconst
> +
> +.macro sum_sub out, in, c, op, p
> + vsetivli t0, 4, e16, mf2, tu, ma
I tihnk that you don't need t0 here? Ditto below.
> + .ifc \op, +
> + .ifc \p, 2
> + vslidedown.vi v8, \in, 4
> + vwmacc.vx \out, \c, v8
> + .else
> + vwmacc.vx \out, \c, \in
> + .endif
> + .else
> + .ifc \p, 2
> + neg \c, \c
> + vslidedown.vi v8, \in, 4
> + vwmacc.vx \out, \c, v8
> + neg \c, \c
> + .else
> + neg \c, \c
> + vwmacc.vx \out, \c, \in
> + neg \c, \c
The typical problem with complex nested macros like this is, you easily end up
assembling very inefficient code.
For instance, this keeps vainly flipping the sign of the same value over and
over only to allow this macro to exist.
> + .endif
> + .endif
> +.endm
> +
> +.macro add_member32 in, t0, index0, t1, index1, t2, index2, t3, index3,
> op0, op1, op2, op3, p
> + vsetivli t0, 1, e16, m1, tu, ma
> + vslidedown.vi v12, \t0, \index0
> + vmv.x.s s2, v12
> + vslidedown.vi v12, \t1, \index1
> + vmv.x.s s3, v12
> + vslidedown.vi v12, \t2, \index2
> + vmv.x.s s4, v12
> + vslidedown.vi v12, \t3, \index3
> + vmv.x.s s5, v12
This is a very inefficient way to extract 4 scalars out of a vector. I'm not
familar with the overall specific algorithm, but I would expect that this can
be avoided. At least, I have never seen need for such construct to implement a
DCT. And we already have quite a few DCTs in the FFmpeg RISC-V port.
Admittedly for smaller matrices than 32x32. But typically larger matrices are
not as big a deal with RVV as they are on Arm or x86. RVV requires spilling to
the intermediate values from the first DCT dimension to memory, to transpose
them before the second DCT dimension. That being the case, the penalty for not
fitting the entire matrix in the vector register bank is comparatively much
smaller.
And yes, please don't use slides for transposition. It's horribly complicated
and almost certainly slower than spilling to stack and using strided loads/
stores, for any non-trivial matrix size.
--
Rémi Denis-Courmont
Tapiolan uusi kaupunki, Uudenmaan entinen Suomen tasavalta
More information about the ffmpeg-devel
mailing list