[FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%

Martin Storsjö martin at martin.st
Mon Sep 14 23:11:55 EEST 2020


Hi,

On Tue, 18 Aug 2020, Sebastian Pop wrote:

> Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
> accesses by 7.5%.
> The values loaded for the filter are used in the 4 unrolled iterations and
> avoids reloading 3 times the same values.
> The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
> instance with the following command:
> $ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
> testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
>
> before: 1551591469      cache-references
> after:  1436140431      cache-references
>
> before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
> min:0.012975
> after:  [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
> min:0.012158

This function (like most in swscale) lacks a checkasm test; it'd be great 
to have one, to allow for proper microbenchmarks of different tunings on 
it. The existing tests in tests/checkasm/sw*.c serve as decent examples 
for how to write a test for it. This isn't a blocker, but checkasm is a 
great tool for testing asm, and I personally try to add tests for anything 
I'm writing asm for if there isn't already one.

When targeting linux, by default checkasm ends up using the perf cycle 
counter API. This is a bit inexact for finetuning of asm, so personally I 
also prefer configuring with --disable-linux-perf and loading a kernel 
module to allow usermode access to the raw cycle counter register - e.g. 
the module from https://code.videolan.org/janne/arm64-cycle-cnt should 
work fine.


> diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
> index af71de6050..2cba810e16 100644
> --- a/libswscale/aarch64/output.S
> +++ b/libswscale/aarch64/output.S
> @@ -26,10 +26,13 @@ function ff_yuv2planeX_8_neon, export=1
>          ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
>  1:      uxtl                v0.8H, v0.8B                    // extend dither to 16-bit
>          ushll               v1.4S, v0.4H, #12               // extend dither to 32-bit with left shift by 12 (part 1)
> -        ushll2              v2.4S, v0.8H, #12               // extend dither to 32-bit with left shift by 12 (part 2)
> +        ushll2              v0.4S, v0.8H, #12               // extend dither to 32-bit with left shift by 12 (part 2)
> +        tst                 w4, #0x1f                       // check dstW % 32
>          mov                 x7, #0                          // i = 0
> +        b.eq                4f                              // jump to outer loop unrolled by 4
> +
>  2:      mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
> -        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
> +        mov                 v4.16B, v0.16B                  // initialize accumulator part 2 with dithering value
>          mov                 w8, w1                          // tmpfilterSize = filterSize
>          mov                 x9, x2                          // srcp    = src
>          mov                 x10, x0                         // filterp = filter
> @@ -55,4 +58,73 @@ function ff_yuv2planeX_8_neon, export=1
>          add                 x7, x7, #8                      // i += 8
>          b.gt                2b                              // loop until width consumed
>          ret
> +
> +// Outer loop unrolled by 4 to maximize reuse of filter values.
> +4:      mov                 v2.16b, v0.16b                  // initialize accumulator with dithering value
> +        mov                 v3.16b, v0.16b                  // initialize accumulator with dithering value
> +        mov                 v4.16b, v0.16b                  // initialize accumulator with dithering value
> +        mov                 v5.16b, v0.16b                  // initialize accumulator with dithering value
> +        mov                 v6.16b, v1.16b                  // initialize accumulator with dithering value
> +        mov                 v7.16b, v1.16b                  // initialize accumulator with dithering value
> +        mov                 v16.16b, v1.16b                 // initialize accumulator with dithering value
> +        mov                 v17.16b, v1.16b                 // initialize accumulator with dithering value

For asm like this, I personally prefer to align the columns vertically. In 
this file, the existing code isn't already aligned, so there's not much 
example for follow, but I normally align columns that contain vector 
references like v16.16b according the max width of those, and GPR 
references (and references to the vector registers without the layout 
appended, like q19) against the max size for those (that is, 3 chars). 
Columns with other constructs (memory operands etc) is more ad-hoc 
aligned.

It makes the asm a bit more readable IMO.

> +        mov                 w8, w1                          // tmpfilterSize = filterSize
> +        mov                 x9, x2                          // srcp    = src
> +        mov                 x10, x0                         // filterp = filter
> +
> +5:      ldp                 x15, x14, [x9], #16             // get 2 pointers: src[j] and src[j+1]
> +        ld1r                {v18.8h}, [x10], #2             // read 1x16-bit coeff X at filter[j  ] and duplicate across lanes
> +        ldrh                w12, [x10], #2                  // read 1x16-bit coeff Y at filter[j+1]

Here there's a data dependency between the first load, that updates x10, 
and the second one. Moving the first load from x10 to before the "ldp x15, 
14" would help on in-order cores.

> +        dup                 v27.4h, w12                     // and duplicate across low lanes
> +        dup                 v27.8h, w12                     // and duplicate across high lanes

Umm, what? A dup for v27.8h will fill the lower 4 elements as well. I 
don't see why this needs to be different from the ld1r for the filter[j] 
coefficients

> +
> +        add                 x15, x15, x7, lsl #1            // &src[j  ][i]
> +        ldp                 q19, q20, [x15]                 // read 16x16-bit @ src[j  ][i + {0..15}]
> +        ldp                 q23, q24, [x15, #32]            // read 16x16-bit @ src[j  ][i + {16..32}]
> +
> +        add                 x14, x14, x7, lsl #1            // &src[j+1][i]
> +        ldp                 q21, q22, [x14]                 // read 16x16-bit @ src[j+1][i + {0..15}]
> +        ldp                 q25, q26, [x14, #32]            // read 16x16-bit @ src[j+1][i + {16..32}

Are the loads from x14/x15 guaranteed to be aligned here? (IIRC ldp does 
require the addresses to be aligned.) If this would be written using ld1 
(like the existing original code), alignment wouldn't matter. That 
disallows using the pre-indexed form though, but an ld1 can load 64 byte 
in one go anyway. And doing both adds first, followed by two ld1's, avoids 
the data dependencies a bit.

> +
> +        smlal               v17.4s, v19.4h, v18.4h          // val0 += {A,B,C,D} * X
> +        smlal               v16.4s, v20.4h, v18.4h          // ... (see comments in non unrolled code)
> +        smlal               v7.4s, v23.4h, v18.4h
> +        smlal               v6.4s, v24.4h, v18.4h
> +
> +        smlal2              v2.4s, v19.8h, v18.8h
> +        smlal2              v3.4s, v20.8h, v18.8h
> +        smlal2              v4.4s, v23.8h, v18.8h
> +        smlal2              v5.4s, v24.8h, v18.8h
> +
> +        smlal               v17.4s, v21.4h, v27.4h
> +        smlal               v16.4s, v22.4h, v27.4h
> +        smlal               v7.4s, v25.4h, v27.4h
> +        smlal               v6.4s, v26.4h, v27.4h
> +
> +        smlal2              v2.4s, v21.8h, v27.8h
> +        smlal2              v3.4s, v22.8h, v27.8h
> +        smlal2              v4.4s, v25.8h, v27.8h
> +        smlal2              v5.4s, v26.8h, v27.8h
> +
> +        subs                w8, w8, #2                      // tmpfilterSize -= 2

I'd suggest moving the subs earlier, to avoid the dependency between the 
subs and the b.gt below.

> +        b.gt                5b                              // loop until filterSize consumed
> +
> +        sqshrun             v17.4h, v17.4s, #16             // clip16(val0>>16)
> +        sqshrun             v16.4h, v16.4s, #16             // ... (see comments in non unrolled code)
> +        sqshrun             v7.4h, v7.4s, #16
> +        sqshrun             v6.4h, v6.4s, #16
> +        sqshrun2            v17.8h, v2.4s, #16
> +        sqshrun2            v16.8h, v3.4s, #16
> +        sqshrun2            v7.8h, v4.4s, #16
> +        sqshrun2            v6.8h, v5.4s, #16
> +        uqshrn              v2.8b, v17.8h, #3
> +        uqshrn              v3.8b, v16.8h, #3
> +        uqshrn              v4.8b, v7.8h, #3
> +        uqshrn              v5.8b, v6.8h, #3
> +        stp                 d2, d3, [x3], #16               // write to destination
> +        stp                 d4, d5, [x3], #16               // write to destination
> +        subs                w4, w4, #32                     // dstW -= 32

Maybe move the subs even further back here, between the uqrshrn and the 
stp?

And the stp could be converted into one single st1 here as well.


// Martin



More information about the ffmpeg-devel mailing list