[FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add hscale specializations

Wed Apr 20 11:44:24 EEST 2022

On Sun, 17 Apr 2022, Martin Storsjö wrote:

> On Fri, 15 Apr 2022, Swinney, Jonathan wrote:
>
>> This patch adds specializations for hscale for filterSize == 4 and 8 and
>> converts the existing implementation for the X8 version. For the old code, 
>> now
>> used for the X8 version, it improves the efficiency of the final summations 
>> by
>> reducing 11 instructions to 7.
>> 
>> ff_hscale8to15_8_neon is mostly unchanged from the original except for a 
>> few
>> changes.
>> - The loads for the filter data were consolidated into a single 64 byte ld1
>>   instruction.
>
> Couldn't you do this optimization on the existing function too?

Sorry, now I realized why this optimization only can be done if you 
operate on a specific known filter width.

>> - The final summations were improved.
>> - The inner loop on filterSize was completely removed
>
> I presume that this is the only differing factor which affects whether it's 
> worthwhile to keep a separate width=8 function or not. At least from the 
> checkasm benchmark numbers, the difference is notable but not huge (on the 
> range of 4-10%, while the summation improvements gain even more).
>
> Given a fully optimized function that has an inner loop (which is only taken 
> once for the width=8 case), is the separate function without an inner loop 
> really necessary?

With the ideal version of the final summation in both functions, the 
separate filtersize=8 function is 11-19% faster than the generic 
multiple-of-8 function (on Cortex A53 and A72 - on A73 the both versions 
are essentially equally fast), so there's probably good reason to go with 
the separate version.

Thus, disregard the review comments above.

// Martin