[FFmpeg-devel] [PATCH v2 3/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1
Ramiro Polla
ramiro.polla at gmail.com
Thu Aug 22 14:29:57 EEST 2024
On Wed, Aug 21, 2024 at 9:44 PM Martin Storsjö <martin at martin.st> wrote:
> On Wed, 21 Aug 2024, Ramiro Polla wrote:
> >> BTW, this instruction is kinda exotic and the docs aren't super clear, so
> >> it'd be good to test manually that it really does what we want, for
> >> negative numbers and numbers close to the ends of the value range; I
> >> didn't do that manually yet.
> >
> > I prefer just sticking to sxtw + lsl then. When we move to ptrdiff_t
> > the sxtw will be gone anyway.
>
> This sounds like a very reasonable choice indeed, especially if it's
> somewhat plausible that we'll get rid of it at some point in the future.
>
> >>> + movi v0.16b, #0
> >>> + mov w3, #16
> >>> +
> >>> +1:
> >>> + ld1 {v1.16b}, [x0], x1
> >>> + ld1 {v2.16b}, [x2], x1
> >>> + subs w3, w3, #2
> >>> + uadalp v0.8h, v1.16b
> >>> + uadalp v0.8h, v2.16b
> >>> + b.ne 1b
> >>> +
> >>> + uaddlv s0, v0.8h
> >>> + fmov w0, s0
> >>> +
> >>> + ret
> >>> +endfunc
> >>> +
> >>> +function ff_pix_norm1_neon, export=1
> >>> +// x0 const uint8_t *pix
> >>> +// x1 int line_size
> >>> +
> >>> + sxtw x1, w1
> >>> + movi v4.16b, #0
> >>> + movi v5.16b, #0
> >>> + mov w2, #16
> >>> +
> >>> +1:
> >>> + ld1 {v1.16b}, [x0], x1
> >>> + subs w2, w2, #1
> >>> + umull v2.8h, v1.8b, v1.8b
> >>> + umull2 v3.8h, v1.16b, v1.16b
> >>> + uadalp v4.4s, v2.8h
> >>> + uadalp v5.4s, v3.8h
> >>
> >> From my earlier testing on A53, it seemed (surprisingly) to be equally
> >> fast to accumulate into the same register for both instructions - but I
> >> only tested that on A53. So we could change that here, getting rid of the
> >> add at the end (and one movi). Or if it does help on some other core,
> >> perhaps we should do the same for the function above too?
> >
> > Indeed, it is equally fast to accumulate into the same register on the
> > A55 and A76 as well.
> >
> > New patches attached (patch 3/7 has functional changes, but patch 4/7
> > only changes the commit message to reflect the new test run).
>
> LGTM very much now, thanks! And thanks for your patience through all the
> iterations on such trivial patches as these.
And thank you for your patience through the reviews :). I'm slowly
getting up to speed with aarch64 and neon.
I'll apply the pix_sum and pix_norm1 patches, and I'll wait a few days
for any comments on the draw_edges patches.
More information about the ffmpeg-devel
mailing list