[FFmpeg-devel] [PATCH 2/7] avcodec/aarch64/mpegvideoencdsp: add neon implementations for pix_sum and pix_norm1

Ramiro Polla ramiro.polla at gmail.com
Mon Aug 19 15:00:25 EEST 2024


On Mon, Aug 19, 2024 at 11:46 AM Martin Storsjö <martin at martin.st> wrote:
> On Sun, 18 Aug 2024, Ramiro Polla wrote:
> > I had tested the real world case on the A76, but not on the A53. I
> > spent a couple of hours with perf trying to find the source of the
> > discrepancy but I couldn't find anything conclusive. I need to learn
> > more about how to test cache misses.
>
> Nah, I guess that's a bit overkill...
>
> > I just tested again with the following command:
> > $ taskset -c 2 ./ffmpeg_g -benchmark -f lavfi -i
> > "testsrc2=size=1920x1080" -vcodec mpeg4 -q 31 -vframes 100 -f rawvideo
> > -y /dev/null
> >
> > The entire test was about 1% faster unrolled on A53, but about 1%
> > slower unrolled on A76 (I had the Raspberry Pi 5 in mind for these
> > optimizations, so I preferred choosing the version that was faster on
> > the A76).
>
> > I wonder if there is any way we could check at runtime.
>
> There are indeed often cases where functions could be tuned differently
> for older/newer or in-order/out-of-order cores. In most cases, trying to
> specialize things is a bit waste and overkill though - in most cases, I'd
> just suggest going with a compromise.
>
> (Sometimes, different kinds of tunings can be applied if you use e.g. the
> flag dotprod to differentiate between older and newer cores. But it's
> seldom worth the extra effort to do that.)
>
>
> Right, so looking at your unrolled case, you've done a full unroll. That's
> probably also a bit overkill.
>
> The in-order cores really hate tight loops where almost everything has a
> sequential dependency on the previous instruction - so the general rule of
> thumb is that you'll want to unroll by a factor of 2, unless the algorithm
> itself has enough complexity that there's two separate dependency chains
> interlinked.
>
> Also, from your unrolled version, there's a slight bug in it:
>
> > +        add             x2, x0, w1, sxtw
> > +        lsl             w1, w1, #1
>
> If the stride is a negative number, the first sxtw does the right thing,
> but the "lsl w1, w1, #1" will zero out the upper half of the register.

I'll start adding negative stride tests to checkasm to spot these bugs.

> So for that, you'd still need to keep the "sxtw x1, w1" instruction, and
> do the lsl on x1 instead. It is actually possible to merge it into one
> instruction though, with "sbfiz x1, x1, #1, #32", if I read the docs
> right. But that's a much more uncommon instruction...
>
> As for optimal performance here - I tried something like this:
>
>          movi            v0.16b, #0
>          add             x2, x0, w1, sxtw
>          sbfiz           x1, x1, #1, #32
>          mov             w3, #16
>
> 1:
>          ld1             {v1.16b}, [x0], x1
>          ld1             {v2.16b}, [x2], x1
>          subs            w3, w3, #2
>          uadalp          v0.8h, v1.16b
>          uadalp          v0.8h, v2.16b
>          b.ne            1b
>
>          uaddlv          s0, v0.8h
>          fmov            w0, s0
>
>          ret
>
> With this, I'm down from your 120 cycles on the A53 originally, to 78.7
> cycles now. Your fully unrolled version seemed to run in 72 cycles on the
> A53, so that's obviously even faster, but I think this kind of tradeoff
> might be the sweet spot. What does such a version give you in terms of
> real world speed?

This version is around 0.5% slower overall on the A76. Very roughly
these are the total times taken by pix_sum and pix_norm1 with the
different implementations on A76:
c: ~5%
fully unrolled: ~3%
unroll 2: 2.5%
tight loop: 2%


More information about the ffmpeg-devel mailing list