[FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths

Thu Mar 31 18:15:26 EEST 2022

On 30/03/2022 13:35, Martin Storsjö wrote:
> Overall, the code looks sensible to me.
> 
> Would it make sense to share the core of the filter between the 
> horizontal/vertical cases with e.g. a macro? (I didn't check in detail 
> if there's much differences in the core of the filter. At most some 
> differences in condition registers for partial writeout in the 
> horizontal forms?)

Well, looking at the comments at the right-hand side of the source, 
which give the logical meaning of the results of each instruction, I 
admit there's a resemblance in the middle of the 8-pixel-pair function. 
However, the physical register assignments are quite different, and 
attempting to reassign the registers in one to match the other isn't a 
trivial task. It's hard enough when you start register assignment from 
the top of a function and work your way down, as I have done here.

In the 16-pixel-pair case, the fact that the input values arrive in a 
different order as the result of them, in one case, being loaded in 
regularly-increasing address order, and in the other, falling out of a 
matrix transposition, has resulted in even the logical order of 
instructions being quite different in the two cases.

In the 4-pixel-pair case, the values are packed differently into 
registers in the two cases, because in the v case, we're loading 4 
pixels between row-strides, which means it's easy to place each row in 
its own vector, whereas in the h case we load 4 rows of 8 pixels each 
and transpose, which leaves the values in 4 vectors rather than 8. Some 
of the filtering steps can be performed with the data packed in this way 
(calculating a1 and a2) while waiting for it to be restructured in order 
to calculate the other metrics, but it's not worth packing the data 
together in this way in the v case given that it starts off already 
separated. So the two implementations end up quite different in the 
operations they perform, not just the scheduling of instructions and in 
register assignment terms.

Some background: as you may have guessed, I didn't start out writing 
these functions as they currently appear. Prototype versions didn't care 
much for scheduling or keeping to a small number of registers. They were 
primarily for checking the correctness of the mathematics, and they'd 
use all available vectors, sometimes shuffling values between registers 
or to the stack to make room. Once I'd verified correctness, I then 
reworked them to keep to a minimal number of registers and to minimise 
stalls as far as possible.

I'm targeting the Cortex-A72, since that's what the Raspberry Pi 4 uses 
and it's on the cusp of having enough power to decode VC-1 BluRay 
streams, so I deliberately didn't take too much consideration of the 
requirements of earlier cores. Yes, it's an out-of-order core, but I 
reckoned there are probably limits to how wisely it can select 
instructions to execute (there have got to be limits to instruction 
queue lengths, for example). So based on the pipeline structure 
documented in Arm's Cortex-A72 software opimization guide, I arranged 
the instructions to best keep all pipelines busy as much as possible, 
then assigned registers to keep the instructions in this order.

For the most part, I was able to keep the number of vectors used low 
enough that no callee-saving was required - or failing that, at least 
avoiding having to spill values to the stack mid-function. But it came 
pretty close at times - witness for example the peculiar order in which 
vectors had to be loaded in the AArch32 version of 
ff_vc1_h_loop_filter16_neon. There's reason behind that!

In short, I'd really rather not tamper with these larger assembly 
functions any more unless I really have to.

Ben