[FFmpeg-devel] [PATCH 3/3] lavc/aarch64: hevc_add_res add 12bit variants

Tue Aug 9 14:21:58 EEST 2022

On Tue, 9 Aug 2022, Martin Storsjö wrote:

> On Thu, 23 Jun 2022, J. Dekker wrote:
>
>> hevc_add_res_4x4_12_c: 46.0
>> hevc_add_res_4x4_12_neon: 18.7
>> hevc_add_res_8x8_12_c: 194.7
>> hevc_add_res_8x8_12_neon: 25.2
>> hevc_add_res_16x16_12_c: 716.0
>> hevc_add_res_16x16_12_neon: 69.7
>> hevc_add_res_32x32_12_c: 3820.7
>> hevc_add_res_32x32_12_neon: 261.0
>> 
>> Signed-off-by: J. Dekker <jdek at itanimul.li>
>> ---
>> libavcodec/aarch64/hevcdsp_idct_neon.S    | 148 ++++++++++++----------
>> libavcodec/aarch64/hevcdsp_init_aarch64.c |  34 ++---
>> 2 files changed, 97 insertions(+), 85 deletions(-)
>
> LGTM. The patch is a bit hard to inspect thoroughly (to see exactly how 
> little has changed) due to the functions being moved around at the same time 
> as they're modified, but I checked and the changes do look fine.
>
> By splitting things up in individual macros for each function, (e.g. 
> add_res_4x4, add_res_8x8 etc, then add_res setting the mask and calling the 
> others) you could keep the code in place and make the diff even easier to 
> read, but it's not strictly necessary.

Actually, I do want you to make a change here.

The only single thing that differs between the 10 and 12 bit versions, is 
what the mask register is initialized to. It's totally a waste of space to 
produce two near-identical versions of everything.

Instead I'd suggest making just two frontend functions, which sets the 
mask register and then calls the (non-exported) 16 bit generic function. 
Also, have a look at e.g. vp9mc_16bpp_neon.S, where we have something 
similar:

.macro do_8tap_v_func type, filter, offset, size, bpp
function ff_vp9_\type\()_\filter\()\size\()_v_\bpp\()_neon, export=1
         uxtw            x4,  w4
         mvni            v1.8h, #((0xff << (\bpp - 8)) & 0xff), lsl #8
         movrel          x5,  X(ff_vp9_subpel_filters), 256*\offset
         add             x6,  x5,  w6, uxtw #4
         mov             x5,  #\size
.if \size >= 8
         b               \type\()_8tap_8v
...

For your case, you don't need anything else than the mvni instruction and 
then a branch to the actual implementation.

// Martin