[FFmpeg-devel] [PATCH v2 3/9] lavc/vp9dsp: R-V V ipred hor

flow gg hlefthleft at gmail.com
Tue May 7 21:42:51 EEST 2024


> Do you gain much by unrolling all the way to 16x? Given that you have the
> counter value already in t0, it should not make much difference to just
unroll
> 2x or maybe 4x and then loop.

I chose this simple method because I think the effect is about the same..
Do I need to change it?

> It might also be faster to use lhu or lwu and shift to reduce scalar
loads, at
least if the vector is suitably aligned.

I just tested ff_h_16x16_rvv, lbu version is faster (lbu * 16 version:
80.2, lwu * 4 version: 117.2).


Rémi Denis-Courmont <remi at remlab.net> 于2024年5月8日周三 00:10写道:

> Le tiistaina 7. toukokuuta 2024, 10.36.07 EEST uk7b at foxmail.com a écrit :
> > From: sunyuechi <sunyuechi at iscas.ac.cn>
> >
> > C908:
> > vp9_hor_8x8_8bpp_c: 74.7
> > vp9_hor_8x8_8bpp_rvv_i32: 35.7
> > vp9_hor_16x16_8bpp_c: 175.5
> > vp9_hor_16x16_8bpp_rvv_i32: 80.2
> > vp9_hor_32x32_8bpp_c: 510.2
> > vp9_hor_32x32_8bpp_rvv_i32: 264.0
> > ---
> >  libavcodec/riscv/vp9_intra_rvv.S | 56 ++++++++++++++++++++++++++++++++
> >  libavcodec/riscv/vp9dsp.h        |  6 ++++
> >  libavcodec/riscv/vp9dsp_init.c   |  3 ++
> >  3 files changed, 65 insertions(+)
> >
> > diff --git a/libavcodec/riscv/vp9_intra_rvv.S
> > b/libavcodec/riscv/vp9_intra_rvv.S index db9774c263..dd9bc036e7 100644
> > --- a/libavcodec/riscv/vp9_intra_rvv.S
> > +++ b/libavcodec/riscv/vp9_intra_rvv.S
> > @@ -113,3 +113,59 @@ func_dc dc_left  8   left 3  0  zve64x
> >  func_dc dc_top   32  top  5  1  zve32x
> >  func_dc dc_top   16  top  4  1  zve32x
> >  func_dc dc_top   8   top  3  0  zve64x
> > +
> > +func ff_h_32x32_rvv, zve32x
> > +        li           t0, 32
> > +        addi         a2, a2, 31
> > +        vsetvli      zero, t0, e8, m2, ta, ma
> > +
> > +        .rept 2
> > +        .irp n 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
> > +        lbu          t1, (a2)
> > +        addi         a2, a2, -1
> > +        vmv.v.x      v\n, t1
> > +        .endr
> > +        .irp n 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30
> > +        vse8.v       v\n, (a0)
> > +        add          a0, a0, a1
> > +        .endr
> > +        .endr
>
> Do you gain much by unrolling all the way to 16x? Given that you have the
> counter value already in t0, it should not make much difference to just
> unroll
> 2x or maybe 4x and then loop.
>
> It might also be faster to use lhu or lwu and shift to reduce scalar
> loads, at
> least if the vector is suitably aligned.
>
> > +
> > +        ret
> > +endfunc
> > +
> > +func ff_h_16x16_rvv, zve32x
> > +        addi         a2, a2, 15
> > +        vsetivli     zero, 16, e8, m1, ta, ma
> > +
> > +        .irp n 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
> 22, 23
> > +        lbu          t1, (a2)
> > +        addi         a2, a2, -1
> > +        vmv.v.x      v\n, t1
> > +        .endr
> > +        .irp n 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
> > +        vse8.v       v\n, (a0)
> > +        add          a0, a0, a1
> > +        .endr
> > +        vse8.v       v23, (a0)
> > +
> > +        ret
> > +endfunc
> > +
> > +func ff_h_8x8_rvv, zve32x
> > +        addi         a2, a2, 7
> > +        vsetivli     zero, 8, e8, mf2, ta, ma
> > +
> > +        .irp n 8, 9, 10, 11, 12, 13, 14, 15
> > +        lbu          t1, (a2)
> > +        addi         a2, a2, -1
> > +        vmv.v.x      v\n, t1
> > +        .endr
> > +        .irp n 8, 9, 10, 11, 12, 13, 14
> > +        vse8.v       v\n, (a0)
> > +        add          a0, a0, a1
> > +        .endr
> > +        vse8.v       v15, (a0)
> > +
> > +        ret
> > +endfunc
> > diff --git a/libavcodec/riscv/vp9dsp.h b/libavcodec/riscv/vp9dsp.h
> > index b8ff282f8a..0ad961c7e0 100644
> > --- a/libavcodec/riscv/vp9dsp.h
> > +++ b/libavcodec/riscv/vp9dsp.h
> > @@ -66,6 +66,12 @@ void ff_v_16x16_rvi(uint8_t *dst, ptrdiff_t stride,
> const
> > uint8_t *l, const uint8_t *a);
> >  void ff_v_8x8_rvi(uint8_t *dst, ptrdiff_t stride, const uint8_t *l,
> >                    const uint8_t *a);
> > +void ff_h_32x32_rvv(uint8_t *dst, ptrdiff_t stride, const uint8_t *l,
> > +                    const uint8_t *a);
> > +void ff_h_16x16_rvv(uint8_t *dst, ptrdiff_t stride, const uint8_t *l,
> > +                    const uint8_t *a);
> > +void ff_h_8x8_rvv(uint8_t *dst, ptrdiff_t stride, const uint8_t *l,
> > +                  const uint8_t *a);
> >
> >  #define VP9_8TAP_RISCV_RVV_FUNC(SIZE, type, type_idx)
>
> >   \ void ff_put_8tap_##type##_##SIZE##h_rvv(uint8_t *dst, ptrdiff_t
> > dststride,   \ diff --git a/libavcodec/riscv/vp9dsp_init.c
> > b/libavcodec/riscv/vp9dsp_init.c index c10f8bbe41..7816b13fe0 100644
> > --- a/libavcodec/riscv/vp9dsp_init.c
> > +++ b/libavcodec/riscv/vp9dsp_init.c
> > @@ -59,6 +59,9 @@ static av_cold void
> > vp9dsp_intrapred_init_riscv(VP9DSPContext *dsp, int bpp)
> > dsp->intra_pred[TX_16X16][DC_129_PRED] = ff_dc_129_16x16_rvv;
> > dsp->intra_pred[TX_32X32][TOP_DC_PRED] = ff_dc_top_32x32_rvv;
> > dsp->intra_pred[TX_16X16][TOP_DC_PRED] = ff_dc_top_16x16_rvv; +
>
> > dsp->intra_pred[TX_32X32][HOR_PRED] = ff_h_32x32_rvv; +
> > dsp->intra_pred[TX_16X16][HOR_PRED] = ff_h_16x16_rvv; +
> > dsp->intra_pred[TX_8X8][HOR_PRED] = ff_h_8x8_rvv;
> >          }
> >      #endif
> >      #endif
>
>
> --
> Rémi Denis-Courmont
> http://www.remlab.net/
>
>
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>


More information about the ffmpeg-devel mailing list