[FFmpeg-devel] [PATCH 3/3] lavc/svq1enc: R-V V ssd_int8_vs_int16
Rémi Denis-Courmont
remi at remlab.net
Sun Jan 7 10:03:00 EET 2024
Le sunnuntaina 7. tammikuuta 2024, 3.33.39 EET flow gg a écrit :
> I tested it, and indeed using vwsub is faster. Updated it in the reply.
>
> ---
>
> I have a question: if I tweak the load order a bit, using one less vset, it
> leads to being slower (the patch I submitted is 13.2, if I make the
> following change, the time would be 15.2).
> But I thought it would be faster.
I would guess that v0 is needed before v8 in the internal implementation of
vwsub. This kind of makes sense as the element still need to be sign-extended.
Thus vwsub ends up stalling the pipeline in wait for vle8 to complete. That's
just a guess though, as I don't have internal cycle timing documentation.
> - vsetvli t0, a2, e8, m2, tu, ma
> - vle8.v v0, (a0)
> - sub a2, a2, t0
> - vsetvli zero, t0, e16, m4, tu, ma
> - vle16.v v8, (a1)
> - vsetvli zero, t0, e8, m2, tu, ma
> - vwsub.wv v16, v8, v0
>
> + vsetvli t0, a2, e16, m4, tu, ma
> + vle16.v v8, (a1)
> + sub a2, a2, t0
> + vsetvli zero, t0, e8, m2, tu, ma
> + vle8.v v0, (a0)
> + vwsub.wv v16, v8, v0
--
雷米‧德尼-库尔蒙
http://www.remlab.net/
More information about the ffmpeg-devel
mailing list