[FFmpeg-devel] [PATCH 2/2] swscale/aarch64: Add bgra/rgba to yuv

Rémi Denis-Courmont remi at remlab.net
Thu Jun 20 19:25:53 EEST 2024



Le 20 juin 2024 18:02:31 GMT+02:00, Zhao Zhili <quinkblack at foxmail.com> a écrit :
>
>
>> On Jun 20, 2024, at 20:49, Martin Storsjö <martin at martin.st> wrote:
>> 
>> On Thu, 20 Jun 2024, Zhao Zhili wrote:
>> 
>>>> On Jun 19, 2024, at 20:05, Rémi Denis-Courmont <remi at remlab.net> wrote:
>>>> Le 19 juin 2024 11:24:28 GMT+02:00, Zhao Zhili <quinkblack at foxmail.com <mailto:quinkblack at foxmail.com>> a écrit :
>>>>>> On Jun 19, 2024, at 15:07, Rémi Denis-Courmont <remi at remlab.net> wrote:
>>>>>> Le 15 juin 2024 11:57:18 GMT+02:00, Zhao Zhili <quinkblack at foxmail.com> a écrit :
>>>>>>> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
>>>>>>> index 2b956fe5c2..37f1158504 100644
>>>>>>> --- a/libswscale/aarch64/input.S
>>>>>>> +++ b/libswscale/aarch64/input.S
>>>>>>> @@ -20,8 +20,12 @@
>>>>>>> #include "libavutil/aarch64/asm.S"
>>>>>>> -.macro rgb_to_yuv_load_rgb src
>>>>>>> +.macro rgb_to_yuv_load_rgb src, element=3
>>>>>>> +    .if \element == 3
>>>>>>>      ld3             { v16.16b, v17.16b, v18.16b }, [\src]
>>>>>>> +    .else
>>>>>>> +        ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [\src]
>>>>>>> +    .endif
>>>>>>>      uxtl            v19.8h, v16.8b             // v19: r
>>>>>>>      uxtl            v20.8h, v17.8b             // v20: g
>>>>>>>      uxtl            v21.8h, v18.8b             // v21: b
>>>>>>> @@ -43,7 +47,7 @@
>>>>>>>      sqshrn2         \dst\().8h, \dst2\().4s, \right_shift   // dst_higher_half = dst2 >> right_shift
>>>>>>> .endm
>>>>>>> -.macro rgbToY bgr
>>>>>>> +.macro rgbToY bgr, element=3
>>>>>> AFAICT, you don't need to a macro parameter for component order. Just swap red and blue coefficients in the prologue and then run the bit-exact same loops for bgr/rgb, rgba/bgra and argb/abgr. This adds one branch in the prologue but that's mostly negligible compared to the loop.
>>>>> I’m not sure where to add the branch. Could you elaborate? Do you mean load coefficients first like the following:
>>>>> function ff_bgr24ToUV_half_neon, export=1
>>>>>      ldr             w12, [x6, #12]
>>>>>      ldr             w11, [x6, #16]
>>>>>      ldr             w10, [x6, #20]
>>>>>      ldr             w15, [x6, #24]
>>>>>      ldr             w14, [x6, #28]
>>>>>      ldr             w13, [x6, #32]
>>>>>      rgbToUV_half
>>>>> endfunc
>>>> Hmm, no. You need to jump past the loading of red and blue coefficients. It might help to load green coefficients last.
>>>> By the way, I think you can use LDP instead of LDR.
>>> 
>>> Patch v2 replace LDR by LDP, then the "jump past the loading of red and blue coefficients” doesn’t apply now.
>> 
>> Rémi's point is that you don't need to duplicate the whole function, when the only thing you're changing is a couple of instructions in the prologue of the function. By reusing the actual bulk of the function, you save on binary size.
>
>Thank you for the detailed examples. I missed the key point here is to save binary size.
>
>I have seen similar example of fall through in risk/input_rvv.s. Is it well defined to jump to a local label in another function?

Falling through is well defined so long as we don't use function-sections. Jumping to a label inside another function is well defined, as the assembler has no notion of what a function is.

`func` and `endfunc` are just FFmpeg macros for defining symbols.


More information about the ffmpeg-devel mailing list