[FFmpeg-devel] [PATCH 2/2] swscale/aarch64: Add bgra/rgba to yuv
Rémi Denis-Courmont
remi at remlab.net
Thu Jun 20 19:25:53 EEST 2024
Le 20 juin 2024 18:02:31 GMT+02:00, Zhao Zhili <quinkblack at foxmail.com> a écrit :
>
>
>> On Jun 20, 2024, at 20:49, Martin Storsjö <martin at martin.st> wrote:
>>
>> On Thu, 20 Jun 2024, Zhao Zhili wrote:
>>
>>>> On Jun 19, 2024, at 20:05, Rémi Denis-Courmont <remi at remlab.net> wrote:
>>>> Le 19 juin 2024 11:24:28 GMT+02:00, Zhao Zhili <quinkblack at foxmail.com <mailto:quinkblack at foxmail.com>> a écrit :
>>>>>> On Jun 19, 2024, at 15:07, Rémi Denis-Courmont <remi at remlab.net> wrote:
>>>>>> Le 15 juin 2024 11:57:18 GMT+02:00, Zhao Zhili <quinkblack at foxmail.com> a écrit :
>>>>>>> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
>>>>>>> index 2b956fe5c2..37f1158504 100644
>>>>>>> --- a/libswscale/aarch64/input.S
>>>>>>> +++ b/libswscale/aarch64/input.S
>>>>>>> @@ -20,8 +20,12 @@
>>>>>>> #include "libavutil/aarch64/asm.S"
>>>>>>> -.macro rgb_to_yuv_load_rgb src
>>>>>>> +.macro rgb_to_yuv_load_rgb src, element=3
>>>>>>> + .if \element == 3
>>>>>>> ld3 { v16.16b, v17.16b, v18.16b }, [\src]
>>>>>>> + .else
>>>>>>> + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src]
>>>>>>> + .endif
>>>>>>> uxtl v19.8h, v16.8b // v19: r
>>>>>>> uxtl v20.8h, v17.8b // v20: g
>>>>>>> uxtl v21.8h, v18.8b // v21: b
>>>>>>> @@ -43,7 +47,7 @@
>>>>>>> sqshrn2 \dst\().8h, \dst2\().4s, \right_shift // dst_higher_half = dst2 >> right_shift
>>>>>>> .endm
>>>>>>> -.macro rgbToY bgr
>>>>>>> +.macro rgbToY bgr, element=3
>>>>>> AFAICT, you don't need to a macro parameter for component order. Just swap red and blue coefficients in the prologue and then run the bit-exact same loops for bgr/rgb, rgba/bgra and argb/abgr. This adds one branch in the prologue but that's mostly negligible compared to the loop.
>>>>> I’m not sure where to add the branch. Could you elaborate? Do you mean load coefficients first like the following:
>>>>> function ff_bgr24ToUV_half_neon, export=1
>>>>> ldr w12, [x6, #12]
>>>>> ldr w11, [x6, #16]
>>>>> ldr w10, [x6, #20]
>>>>> ldr w15, [x6, #24]
>>>>> ldr w14, [x6, #28]
>>>>> ldr w13, [x6, #32]
>>>>> rgbToUV_half
>>>>> endfunc
>>>> Hmm, no. You need to jump past the loading of red and blue coefficients. It might help to load green coefficients last.
>>>> By the way, I think you can use LDP instead of LDR.
>>>
>>> Patch v2 replace LDR by LDP, then the "jump past the loading of red and blue coefficients” doesn’t apply now.
>>
>> Rémi's point is that you don't need to duplicate the whole function, when the only thing you're changing is a couple of instructions in the prologue of the function. By reusing the actual bulk of the function, you save on binary size.
>
>Thank you for the detailed examples. I missed the key point here is to save binary size.
>
>I have seen similar example of fall through in risk/input_rvv.s. Is it well defined to jump to a local label in another function?
Falling through is well defined so long as we don't use function-sections. Jumping to a label inside another function is well defined, as the assembler has no notion of what a function is.
`func` and `endfunc` are just FFmpeg macros for defining symbols.
More information about the ffmpeg-devel
mailing list