[FFmpeg-devel] [PATCH 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths

Thu Mar 31 19:47:41 EEST 2022

On 30/03/2022 15:14, Martin Storsjö wrote:
> On Fri, 25 Mar 2022, Ben Avison wrote:
>> +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 
>> 128)
>> +// On entry:
>> +//   x0 -> array of 64x 16-bit coefficients
>> +//   x1 -> 8-bit results
>> +//   x2 = row stride for results, bytes
>> +function ff_put_signed_pixels_clamped_neon, export=1
>> +        ld1             {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64
>> +        movi            v4.8b, #128
>> +        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x0]
>> +        sqxtn           v0.8b, v0.8h
>> +        sqxtn           v1.8b, v1.8h
>> +        sqxtn           v2.8b, v2.8h
>> +        sqxtn           v3.8b, v3.8h
>> +        sqxtn           v5.8b, v16.8h
>> +        add             v0.8b, v0.8b, v4.8b
> 
> Here you could save 4 add instructions with sqxtn2 and adding .16b 
> vectors, but I'm not sure if it's wortwhile. (It reduces the checkasm 
> numbers by 0.7 for Cortex A72, by 0.3 for A73, but increases the runtime 
> by 1.0 on A53.) Stranegely enough, I get much smaller numbers on my A72 
> than you got.

That's weird. As you say, it should be independent of clock-frequency. 
FWIW, I'm benchmarking on a Raspberry Pi 4; I'd assume all its board 
variants' Cortex-A72 cores are of identical revision.

Now I run it again, I'm getting these figures:

idctdsp.add_pixels_clamped_c: 313.3
idctdsp.add_pixels_clamped_neon: 24.3
idctdsp.put_pixels_clamped_c: 220.3
idctdsp.put_pixels_clamped_neon: 15.5
idctdsp.put_signed_pixels_clamped_c: 210.5
idctdsp.put_signed_pixels_clamped_neon: 19.5

which is more in line with what you see! I am getting a lot of 
variability between runs though - from a small sample, I'm seeing 
add_pixels_clamped_neon coming out as anything from 21 to 30, which is 
well above the sort of differences you're seeing between alternate 
implementations.

This sort of case is always going to be difficult to schedule optimally 
for multiple core - factors like how much dual-issuing is possible, 
latency before values can be used, load speed and the granularity of 
scoreboarding parts of vectors, all vary widely.

In the case of the Cortex-A72, the critical path goes
ld1 of first 16 bytes -> sqxtn:  5 cycles
sqxtn -> add:                    4 cycles
add -> st1 of first 8 bytes:     3 cycles

It then bangs out one store per cycle, a total of 8. Everything else can 
largely be fitted in around this - so for example, other than I-cache 
usage, there shouldn't be a disadvantage to the adds being non-Q-form as 
they should dual-issue with the sqxtns and st2s - you'll notice I have 
them alternating.

I'd have expected anything interfering with this (such as by updating 
half the vector input required by any Q-form add) to slow things down.

Ben