[FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add hscale specializations
Martin Storsjö
martin at martin.st
Wed Apr 20 11:44:24 EEST 2022
On Sun, 17 Apr 2022, Martin Storsjö wrote:
> On Fri, 15 Apr 2022, Swinney, Jonathan wrote:
>
>> This patch adds specializations for hscale for filterSize == 4 and 8 and
>> converts the existing implementation for the X8 version. For the old code,
>> now
>> used for the X8 version, it improves the efficiency of the final summations
>> by
>> reducing 11 instructions to 7.
>>
>> ff_hscale8to15_8_neon is mostly unchanged from the original except for a
>> few
>> changes.
>> - The loads for the filter data were consolidated into a single 64 byte ld1
>> instruction.
>
> Couldn't you do this optimization on the existing function too?
Sorry, now I realized why this optimization only can be done if you
operate on a specific known filter width.
>> - The final summations were improved.
>> - The inner loop on filterSize was completely removed
>
> I presume that this is the only differing factor which affects whether it's
> worthwhile to keep a separate width=8 function or not. At least from the
> checkasm benchmark numbers, the difference is notable but not huge (on the
> range of 4-10%, while the summation improvements gain even more).
>
> Given a fully optimized function that has an inner loop (which is only taken
> once for the width=8 case), is the separate function without an inner loop
> really necessary?
With the ideal version of the final summation in both functions, the
separate filtersize=8 function is 11-19% faster than the generic
multiple-of-8 function (on Cortex A53 and A72 - on A73 the both versions
are essentially equally fast), so there's probably good reason to go with
the separate version.
Thus, disregard the review comments above.
// Martin
More information about the ffmpeg-devel
mailing list