[FFmpeg-devel] [PATCH] libavcodec/hevcdsp: port SIMD idct functions from 32-bit.

Sat Jan 16 22:27:37 EET 2021

On Sat, 16 Jan 2021, Reimar Döffinger wrote:

>> On 15 Jan 2021, at 23:55, Martin Storsjö <martin at martin.st> wrote:
>> 
>> On Tue, 12 Jan 2021, Reimar.Doeffinger at gmx.de wrote:
>> 
>>> create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S
>>> create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>> 
>> This patch fails checkasm
>
> Fixed, one mis-translated coefficient index...

Ok, good.

>>> 
>>> +.macro fixsqrshrn d, dt, n, m
>>> +  .ifc \dt, .8H
>>> +        sqrshrn2        \d\dt, \n\().4S, \m
>>> +  .else
>>> +        sqrshrn         \n\().4H, \n\().4S, \m
>>> +        mov             \d\().D[0], \n\().D[0]
>>> +  .endif
>>> +.endm
>> 
>> Is this to set the lower half of the dest register, without wiping the upper part? It looks a bit clumsy (and stall-prone on in-order cores), but I guess it's not easy to do things differently without rewriting the higher level structure?
>
> Did not have a good idea, but I was also aiming for keeping the structure of the 32-bit and 64-bit code similar.
> In particular since I did not know about checkasm and expected some further painful debug.
>
>> 
>>> +.macro transpose_8x8 r0, r1, r2, r3, r4, r5, r6, r7
>>> +        transpose8_4x4  \r0, \r1, \r2, \r3
>>> +        transpose8_4x4  \r4, \r5, \r6, \r7
>>> +.endm
>>> +
>> 
>> There's a bunch of existing transposes in libavcodec/aarch64/neon.S - can they be used? Not that they're rocket science, though... And I see that the existing arm version also has got its own transpose macros.
>> 
>> If it's inconvenient to use shared macros, this is fine.
>
> They are different and seem to not be documented, so it would take some
> time to figure out how to replace them.
> There’s also a bit of a question if I’d want to give up alignment
> with the 32-bit code just yet.

Sure - one doesn't have to give up on matching the overall structure, but 
it doesn't necessarily need to stay 1:1 on the individual instruction 
level, if things are done more efficiently in one way or another on the 
other architecture. Not saying that these are more or less efficient; a 
transpose usually is the same anyway, but the aarch64 macros take a number 
of more temp regs as arguments.

>> 
>>> +        // Transpose each 4x4 block, and swap how d4-d7 and d8-d11 are used.
>>> +        // Layout before:
>>> +        // d0  d1
>>> +        // d2  d3
>>> +        // d4  d5
>>> +        // d6  d7
>>> +        // d8  d9
>>> +        // d10 d11
>>> +        // d12 d13
>>> +        // d14 d15
>> 
>> These layouts don't look like they're up to date for the aarch64 version?
>
> Removed in new version as it seems not that useful.
>
>>> 
>>> +        vst1.s32        {\in7}, [r3, :128]
>>> +.endm
>> 
>> This is left behind untranslated (and thus unused)
>
> Removed. I am not sure if that means it’s also unused in the 32-bit version.

Possibly - a patch for removing that, if it is, would be good too.

>>> +
>>> +        movrel          x1, trans + 16
>> 
>> The movrel macro has got a separate third optional argument for the offset, so write this "movrel x1, trans, 16". (Older versions of llvm are picky with the symbol offset syntax and break with this, as the macro right now adds its own implicit +(0) here. If you pass the offset in the macro parameter, all the offsets get passed within the parentheses.
>
> Changed.
>
>>> +.macro idct_16x16 bitdepth
>>> +function ff_hevc_idct_16x16_\bitdepth\()_neon, export=1
>>> +//r0 - coeffs
>>> +        mov             x15, lr
>>> +
>> 
>> Binutils doesn't recognize "lr" as alias for x30
>
> It didn’t have an issue in the Debian unstable VM?
> That seems like the kind of workaround where it would be
> better to leave a comment with more info, if you know
> what exactly is affected.

Binutils 2.28 doesn't recognize "lr" while 2.30 does, it seems.

FWIW, all the existing aarch64 assembly just uses "x30" to refer to this 
register, none of it uses "lr".

// Martin