[FFmpeg-devel] [PATCH] aarch64: h264pred: Optimize the inner loop of existing 8 bit functions

Mon Apr 12 16:58:15 EEST 2021

Apr 12, 2021, 10:07 by martin at martin.st:

> Move the loop counter decrement further from the branch instruction,
> this hides the latency of the decrement.
>
> In loops that first load, then store (the horizontal prediction cases),
> do the decrement after the load (where the next instruction would
> stall a bit anyway, waiting for the result of the load).
>
> In loops that store twice using the same destination register,
> also do the decrement between the two stores (as the second store
> would need to wait for the updated destination register from the
> first instruction).
>
> In loops that store twice to two different destination registers,
> do the decrement before both stores, to do it as soon before the
> branch as possible.
>
> This gives minor (1-2 cycle) speedups in most cases (modulo measurement
> noise), but the horizontal prediction functions get a rather notable
> speedup on the Cortex A53.
>

LGTM