[FFmpeg-devel] [PATCH v2 3/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Thu Aug 22 14:29:57 EEST 2024

On Wed, Aug 21, 2024 at 9:44 PM Martin Storsjö <martin at martin.st> wrote:
> On Wed, 21 Aug 2024, Ramiro Polla wrote:
> >> BTW, this instruction is kinda exotic and the docs aren't super clear, so
> >> it'd be good to test manually that it really does what we want, for
> >> negative numbers and numbers close to the ends of the value range; I
> >> didn't do that manually yet.
> >
> > I prefer just sticking to sxtw + lsl then. When we move to ptrdiff_t
> > the sxtw will be gone anyway.
>
> This sounds like a very reasonable choice indeed, especially if it's
> somewhat plausible that we'll get rid of it at some point in the future.
>
> >>> +        movi            v0.16b, #0
> >>> +        mov             w3, #16
> >>> +
> >>> +1:
> >>> +        ld1             {v1.16b}, [x0], x1
> >>> +        ld1             {v2.16b}, [x2], x1
> >>> +        subs            w3, w3, #2
> >>> +        uadalp          v0.8h, v1.16b
> >>> +        uadalp          v0.8h, v2.16b
> >>> +        b.ne            1b
> >>> +
> >>> +        uaddlv          s0, v0.8h
> >>> +        fmov            w0, s0
> >>> +
> >>> +        ret
> >>> +endfunc
> >>> +
> >>> +function ff_pix_norm1_neon, export=1
> >>> +// x0  const uint8_t *pix
> >>> +// x1  int line_size
> >>> +
> >>> +        sxtw            x1, w1
> >>> +        movi            v4.16b, #0
> >>> +        movi            v5.16b, #0
> >>> +        mov             w2, #16
> >>> +
> >>> +1:
> >>> +        ld1             {v1.16b}, [x0], x1
> >>> +        subs            w2, w2, #1
> >>> +        umull           v2.8h, v1.8b,  v1.8b
> >>> +        umull2          v3.8h, v1.16b, v1.16b
> >>> +        uadalp          v4.4s, v2.8h
> >>> +        uadalp          v5.4s, v3.8h
> >>
> >> From my earlier testing on A53, it seemed (surprisingly) to be equally
> >> fast to accumulate into the same register for both instructions - but I
> >> only tested that on A53. So we could change that here, getting rid of the
> >> add at the end (and one movi). Or if it does help on some other core,
> >> perhaps we should do the same for the function above too?
> >
> > Indeed, it is equally fast to accumulate into the same register on the
> > A55 and A76 as well.
> >
> > New patches attached (patch 3/7 has functional changes, but patch 4/7
> > only changes the commit message to reflect the new test run).
>
> LGTM very much now, thanks! And thanks for your patience through all the
> iterations on such trivial patches as these.

And thank you for your patience through the reviews :). I'm slowly
getting up to speed with aarch64 and neon.

I'll apply the pix_sum and pix_norm1 patches, and I'll wait a few days
for any comments on the draw_edges patches.