[FFmpeg-devel] [aarch64] yuv2planeX - unroll outer loop by 4 to increase performance by 6.3%
Martin Storsjö
martin at martin.st
Mon Sep 14 23:11:55 EEST 2020
Hi,
On Tue, 18 Aug 2020, Sebastian Pop wrote:
> Unrolling by 4 the outer loop in yuv2planeX reduces the number of cache
> accesses by 7.5%.
> The values loaded for the filter are used in the 4 unrolled iterations and
> avoids reloading 3 times the same values.
> The performance was measured on an Arm64 Neoverse-N1 Graviton2 c6g.metal
> instance with the following command:
> $ perf stat -e cache-references ./ffmpeg_g -nostats -f lavfi -i
> testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
>
> before: 1551591469 cache-references
> after: 1436140431 cache-references
>
> before: [bench @ 0xaaaac62b7d30] t:0.013226 avg:0.013219 max:0.013537
> min:0.012975
> after: [bench @ 0xaaaad84f3d30] t:0.012355 avg:0.012381 max:0.013164
> min:0.012158
This function (like most in swscale) lacks a checkasm test; it'd be great
to have one, to allow for proper microbenchmarks of different tunings on
it. The existing tests in tests/checkasm/sw*.c serve as decent examples
for how to write a test for it. This isn't a blocker, but checkasm is a
great tool for testing asm, and I personally try to add tests for anything
I'm writing asm for if there isn't already one.
When targeting linux, by default checkasm ends up using the perf cycle
counter API. This is a bit inexact for finetuning of asm, so personally I
also prefer configuring with --disable-linux-perf and loading a kernel
module to allow usermode access to the raw cycle counter register - e.g.
the module from https://code.videolan.org/janne/arm64-cycle-cnt should
work fine.
> diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
> index af71de6050..2cba810e16 100644
> --- a/libswscale/aarch64/output.S
> +++ b/libswscale/aarch64/output.S
> @@ -26,10 +26,13 @@ function ff_yuv2planeX_8_neon, export=1
> ext v0.8B, v0.8B, v0.8B, #3 // honor offsetting which can be 0 or 3 only
> 1: uxtl v0.8H, v0.8B // extend dither to 16-bit
> ushll v1.4S, v0.4H, #12 // extend dither to 32-bit with left shift by 12 (part 1)
> - ushll2 v2.4S, v0.8H, #12 // extend dither to 32-bit with left shift by 12 (part 2)
> + ushll2 v0.4S, v0.8H, #12 // extend dither to 32-bit with left shift by 12 (part 2)
> + tst w4, #0x1f // check dstW % 32
> mov x7, #0 // i = 0
> + b.eq 4f // jump to outer loop unrolled by 4
> +
> 2: mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value
> - mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value
> + mov v4.16B, v0.16B // initialize accumulator part 2 with dithering value
> mov w8, w1 // tmpfilterSize = filterSize
> mov x9, x2 // srcp = src
> mov x10, x0 // filterp = filter
> @@ -55,4 +58,73 @@ function ff_yuv2planeX_8_neon, export=1
> add x7, x7, #8 // i += 8
> b.gt 2b // loop until width consumed
> ret
> +
> +// Outer loop unrolled by 4 to maximize reuse of filter values.
> +4: mov v2.16b, v0.16b // initialize accumulator with dithering value
> + mov v3.16b, v0.16b // initialize accumulator with dithering value
> + mov v4.16b, v0.16b // initialize accumulator with dithering value
> + mov v5.16b, v0.16b // initialize accumulator with dithering value
> + mov v6.16b, v1.16b // initialize accumulator with dithering value
> + mov v7.16b, v1.16b // initialize accumulator with dithering value
> + mov v16.16b, v1.16b // initialize accumulator with dithering value
> + mov v17.16b, v1.16b // initialize accumulator with dithering value
For asm like this, I personally prefer to align the columns vertically. In
this file, the existing code isn't already aligned, so there's not much
example for follow, but I normally align columns that contain vector
references like v16.16b according the max width of those, and GPR
references (and references to the vector registers without the layout
appended, like q19) against the max size for those (that is, 3 chars).
Columns with other constructs (memory operands etc) is more ad-hoc
aligned.
It makes the asm a bit more readable IMO.
> + mov w8, w1 // tmpfilterSize = filterSize
> + mov x9, x2 // srcp = src
> + mov x10, x0 // filterp = filter
> +
> +5: ldp x15, x14, [x9], #16 // get 2 pointers: src[j] and src[j+1]
> + ld1r {v18.8h}, [x10], #2 // read 1x16-bit coeff X at filter[j ] and duplicate across lanes
> + ldrh w12, [x10], #2 // read 1x16-bit coeff Y at filter[j+1]
Here there's a data dependency between the first load, that updates x10,
and the second one. Moving the first load from x10 to before the "ldp x15,
14" would help on in-order cores.
> + dup v27.4h, w12 // and duplicate across low lanes
> + dup v27.8h, w12 // and duplicate across high lanes
Umm, what? A dup for v27.8h will fill the lower 4 elements as well. I
don't see why this needs to be different from the ld1r for the filter[j]
coefficients
> +
> + add x15, x15, x7, lsl #1 // &src[j ][i]
> + ldp q19, q20, [x15] // read 16x16-bit @ src[j ][i + {0..15}]
> + ldp q23, q24, [x15, #32] // read 16x16-bit @ src[j ][i + {16..32}]
> +
> + add x14, x14, x7, lsl #1 // &src[j+1][i]
> + ldp q21, q22, [x14] // read 16x16-bit @ src[j+1][i + {0..15}]
> + ldp q25, q26, [x14, #32] // read 16x16-bit @ src[j+1][i + {16..32}
Are the loads from x14/x15 guaranteed to be aligned here? (IIRC ldp does
require the addresses to be aligned.) If this would be written using ld1
(like the existing original code), alignment wouldn't matter. That
disallows using the pre-indexed form though, but an ld1 can load 64 byte
in one go anyway. And doing both adds first, followed by two ld1's, avoids
the data dependencies a bit.
> +
> + smlal v17.4s, v19.4h, v18.4h // val0 += {A,B,C,D} * X
> + smlal v16.4s, v20.4h, v18.4h // ... (see comments in non unrolled code)
> + smlal v7.4s, v23.4h, v18.4h
> + smlal v6.4s, v24.4h, v18.4h
> +
> + smlal2 v2.4s, v19.8h, v18.8h
> + smlal2 v3.4s, v20.8h, v18.8h
> + smlal2 v4.4s, v23.8h, v18.8h
> + smlal2 v5.4s, v24.8h, v18.8h
> +
> + smlal v17.4s, v21.4h, v27.4h
> + smlal v16.4s, v22.4h, v27.4h
> + smlal v7.4s, v25.4h, v27.4h
> + smlal v6.4s, v26.4h, v27.4h
> +
> + smlal2 v2.4s, v21.8h, v27.8h
> + smlal2 v3.4s, v22.8h, v27.8h
> + smlal2 v4.4s, v25.8h, v27.8h
> + smlal2 v5.4s, v26.8h, v27.8h
> +
> + subs w8, w8, #2 // tmpfilterSize -= 2
I'd suggest moving the subs earlier, to avoid the dependency between the
subs and the b.gt below.
> + b.gt 5b // loop until filterSize consumed
> +
> + sqshrun v17.4h, v17.4s, #16 // clip16(val0>>16)
> + sqshrun v16.4h, v16.4s, #16 // ... (see comments in non unrolled code)
> + sqshrun v7.4h, v7.4s, #16
> + sqshrun v6.4h, v6.4s, #16
> + sqshrun2 v17.8h, v2.4s, #16
> + sqshrun2 v16.8h, v3.4s, #16
> + sqshrun2 v7.8h, v4.4s, #16
> + sqshrun2 v6.8h, v5.4s, #16
> + uqshrn v2.8b, v17.8h, #3
> + uqshrn v3.8b, v16.8h, #3
> + uqshrn v4.8b, v7.8h, #3
> + uqshrn v5.8b, v6.8h, #3
> + stp d2, d3, [x3], #16 // write to destination
> + stp d4, d5, [x3], #16 // write to destination
> + subs w4, w4, #32 // dstW -= 32
Maybe move the subs even further back here, between the uqrshrn and the
stp?
And the stp could be converted into one single st1 here as well.
// Martin
More information about the ffmpeg-devel
mailing list