[FFmpeg-devel] [PATCH 0/2] swscale: rgb_to_yuv neon optimizations

Sat May 31 12:11:43 EEST 2025

This patch set version applies the last comment on the review by Martin
StorsjÃ¶ to avoid changing the register used for asr, and also for the
context historically appliesd comments:

> This is an unrelated change
Fixed and resolved

> The patch adds trailing whitespace here and in many other places; make sure you don't do that. (It is visible by doing e.g. "git show".)
Verified and made sure that there are NO trailing whitespace even though
it would be much more pleasant to work with this thing if there would be
some autofrmatting tooling.

> Also, as a general rule, indent instructions within macros in the same way as elsewhere.
Code in macro was reindented

> If you with "non-performant mobile" mean small in-order cores, most of them can handle repeated accumulation like these even faster, if you sequence these so that all accumulations to one register is sequentially. E.g. first all "smlal \u_dst1\().4s", followed by all "smlal \u_dst2\().4s", followed by \v_dst1, followed by \v_dst2. It's worth benchmarking if you do have access to such cores (e.g. Cortex-A53/A55; perhaps that's also the case on the Cortex-R you mentioned in the commit message).

I mean generally mobile first CPUs like Arm Cortex-R and even A-53/73. But also I 
just verified even on macbook pro interleaving instruction per the component does not enable IPL (tested by disabling all the possilbe features on the SCTLR and didn't seem much difference on the code) 
but having a "hot-register" for coefficints being multipled several times in parallel per component
actually makes a difference a difference. Here is checask results from macbook w/ my and
interleaved by r/g/b component version

current version

bgra_to_uv_128_neon:                                    28.9 ( 1.14x)
bgra_to_uv_1080_c:                                     250.0 ( 1.00x)
bgra_to_uv_1080_neon:                                  180.8 ( 1.38x)
bgra_to_uv_1920_c:                                     445.7 ( 1.00x)
bgra_to_uv_1920_neon:                                  330.4 ( 1.35x)
bgra_to_uv_half_8_c:                                     7.9 ( 1.00x)
bgra_to_uv_half_8_neon:                                  4.9 ( 1.62x)
bgra_to_uv_half_128_c:                                  26.3 ( 1.00x)
bgra_to_uv_half_128_neon:                               15.8 ( 1.67x)
bgra_to_uv_half_1080_c:                                192.6 ( 1.00x)
bgra_to_uv_half_1080_neon:                             105.9 ( 1.82x)
bgra_to_uv_half_1920_c:                                335.3 ( 1.00x)
bgra_to_uv_half_1920_neon:                             209.2 ( 1.60x)

interleaved by r/g/b component version

bgra_to_uv_128_neon:                                    25.0 ( 1.30x)
bgra_to_uv_1080_c:                                     255.6 ( 1.00x)
bgra_to_uv_1080_neon:                                  182.1 ( 1.40x)
bgra_to_uv_1920_c:                                     448.6 ( 1.00x)
bgra_to_uv_1920_neon:                                  327.0 ( 1.37x)
bgra_to_uv_half_8_c:                                     8.1 ( 1.00x)
bgra_to_uv_half_8_neon:                                  5.4 ( 1.49x)
bgra_to_uv_half_128_c:                                  26.5 ( 1.00x)
bgra_to_uv_half_128_neon:                               15.6 ( 1.70x)
bgra_to_uv_half_1080_c:                                197.8 ( 1.00x)
bgra_to_uv_half_1080_neon:                             111.1 ( 1.78x)
bgra_to_uv_half_1920_c:                                354.3 ( 1.00x)
bgra_to_uv_half_1920_neon:                             233.6 ( 1.52x)

> The code here, and below, is exactly the same as before, except for the postincrement on the load (plus the prefetch). Can we add the postincrement to the macro rather than unmacroing the code?

done

> Instead of adding unusual indentation of the instructions here, you could use 2 instead of 4 spaces for the .if directives, to keep the vertical alignment of the instructions.

Obselete becuase per the previous comment I revert my changes and added
post increment logic into the macro

> Does this make any practical difference, as we're just storing the lower 32 bits anyway?

Not really but I found it quite confusing at first becuase it looks like
this instruction will imply narrowing, but looking into the w13 / w13 is
much more clear what is going on.

**Prefetching**

I removed all the prefetched and factored it out into a separate patch.

> Also, with that fixed, this fails to properly back up and restore registers v8-v15; checkasm doesn't notice this on macOS, but on Linux and windows, checkasm has a call wrapper which does detect such issues.

I managed to rewrite the function to avoid using any callee saved
registers. The only register I keep using is v7 which is not callee saved.

Dmitriy Kovalenko (2):
  swscale: rgb_to_yuv neon optimizations
  swscale: Neon rgb_to_yuv_half process 32 pixels at a time

 libswscale/aarch64/input.S | 195 +++++++++++++++++++++++++++----------
 1 file changed, 143 insertions(+), 52 deletions(-)

-- 
2.49.0