[FFmpeg-devel] [PATCH v2 3/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Wed Aug 21 22:44:41 EEST 2024

On Wed, 21 Aug 2024, Ramiro Polla wrote:

>> BTW, this instruction is kinda exotic and the docs aren't super clear, so
>> it'd be good to test manually that it really does what we want, for
>> negative numbers and numbers close to the ends of the value range; I
>> didn't do that manually yet.
>
> I prefer just sticking to sxtw + lsl then. When we move to ptrdiff_t
> the sxtw will be gone anyway.

This sounds like a very reasonable choice indeed, especially if it's 
somewhat plausible that we'll get rid of it at some point in the future.

>>> +        movi            v0.16b, #0
>>> +        mov             w3, #16
>>> +
>>> +1:
>>> +        ld1             {v1.16b}, [x0], x1
>>> +        ld1             {v2.16b}, [x2], x1
>>> +        subs            w3, w3, #2
>>> +        uadalp          v0.8h, v1.16b
>>> +        uadalp          v0.8h, v2.16b
>>> +        b.ne            1b
>>> +
>>> +        uaddlv          s0, v0.8h
>>> +        fmov            w0, s0
>>> +
>>> +        ret
>>> +endfunc
>>> +
>>> +function ff_pix_norm1_neon, export=1
>>> +// x0  const uint8_t *pix
>>> +// x1  int line_size
>>> +
>>> +        sxtw            x1, w1
>>> +        movi            v4.16b, #0
>>> +        movi            v5.16b, #0
>>> +        mov             w2, #16
>>> +
>>> +1:
>>> +        ld1             {v1.16b}, [x0], x1
>>> +        subs            w2, w2, #1
>>> +        umull           v2.8h, v1.8b,  v1.8b
>>> +        umull2          v3.8h, v1.16b, v1.16b
>>> +        uadalp          v4.4s, v2.8h
>>> +        uadalp          v5.4s, v3.8h
>>
>> From my earlier testing on A53, it seemed (surprisingly) to be equally
>> fast to accumulate into the same register for both instructions - but I
>> only tested that on A53. So we could change that here, getting rid of the
>> add at the end (and one movi). Or if it does help on some other core,
>> perhaps we should do the same for the function above too?
>
> Indeed, it is equally fast to accumulate into the same register on the
> A55 and A76 as well.
>
> New patches attached (patch 3/7 has functional changes, but patch 4/7
> only changes the commit message to reflect the new test run).

LGTM very much now, thanks! And thanks for your patience through all the 
iterations on such trivial patches as these.

// Martin