[FFmpeg-devel] [PATCH] swscale/x86: add avx2 {lum, chr}ConvertRange

Ramiro Polla ramiro.polla at gmail.com
Fri Jun 7 20:43:10 EEST 2024


On Fri, Jun 7, 2024 at 7:38 PM Ramiro Polla <ramiro.polla at gmail.com> wrote:
>
> chrRangeFromJpeg_8_c: 49.4
> chrRangeFromJpeg_8_sse4: 15.9
> chrRangeFromJpeg_8_avx2: 22.6
> chrRangeFromJpeg_24_c: 129.4
> chrRangeFromJpeg_24_sse4: 32.1
> chrRangeFromJpeg_24_avx2: 35.1
> chrRangeFromJpeg_128_c: 534.6
> chrRangeFromJpeg_128_sse4: 165.6
> chrRangeFromJpeg_128_avx2: 100.4
> chrRangeFromJpeg_144_c: 735.6
> chrRangeFromJpeg_144_sse4: 185.1
> chrRangeFromJpeg_144_avx2: 109.4
> chrRangeFromJpeg_256_c: 634.6
> chrRangeFromJpeg_256_sse4: 323.6
> chrRangeFromJpeg_256_avx2: 192.6
> chrRangeFromJpeg_512_c: 1242.4
> chrRangeFromJpeg_512_sse4: 662.1
> chrRangeFromJpeg_512_avx2: 409.1
> chrRangeToJpeg_8_c: 39.6
> chrRangeToJpeg_8_sse4: 15.9
> chrRangeToJpeg_8_avx2: 25.4
> chrRangeToJpeg_24_c: 118.9
> chrRangeToJpeg_24_sse4: 32.9
> chrRangeToJpeg_24_avx2: 30.1
> chrRangeToJpeg_128_c: 636.9
> chrRangeToJpeg_128_sse4: 164.4
> chrRangeToJpeg_128_avx2: 96.6
> chrRangeToJpeg_144_c: 716.4
> chrRangeToJpeg_144_sse4: 187.1
> chrRangeToJpeg_144_avx2: 109.4
> chrRangeToJpeg_256_c: 1258.6
> chrRangeToJpeg_256_sse4: 326.1
> chrRangeToJpeg_256_avx2: 193.9
> chrRangeToJpeg_512_c: 2489.4
> chrRangeToJpeg_512_sse4: 662.1
> chrRangeToJpeg_512_avx2: 382.4
> lumRangeFromJpeg_8_c: 13.6
> lumRangeFromJpeg_8_sse4: 14.4
> lumRangeFromJpeg_8_avx2: 19.6
> lumRangeFromJpeg_24_c: 38.9
> lumRangeFromJpeg_24_sse4: 18.9
> lumRangeFromJpeg_24_avx2: 23.9
> lumRangeFromJpeg_128_c: 239.4
> lumRangeFromJpeg_128_sse4: 81.9
> lumRangeFromJpeg_128_avx2: 51.6
> lumRangeFromJpeg_144_c: 285.1
> lumRangeFromJpeg_144_sse4: 92.1
> lumRangeFromJpeg_144_avx2: 59.6
> lumRangeFromJpeg_256_c: 857.1
> lumRangeFromJpeg_256_sse4: 164.4
> lumRangeFromJpeg_256_avx2: 101.9
> lumRangeFromJpeg_512_c: 1028.6
> lumRangeFromJpeg_512_sse4: 335.6
> lumRangeFromJpeg_512_avx2: 201.4
> lumRangeToJpeg_8_c: 20.4
> lumRangeToJpeg_8_sse4: 14.4
> lumRangeToJpeg_8_avx2: 20.4
> lumRangeToJpeg_24_c: 58.1
> lumRangeToJpeg_24_sse4: 18.9
> lumRangeToJpeg_24_avx2: 22.6
> lumRangeToJpeg_128_c: 327.4
> lumRangeToJpeg_128_sse4: 83.4
> lumRangeToJpeg_128_avx2: 53.6
> lumRangeToJpeg_144_c: 375.6
> lumRangeToJpeg_144_sse4: 93.9
> lumRangeToJpeg_144_avx2: 58.9
> lumRangeToJpeg_256_c: 649.6
> lumRangeToJpeg_256_sse4: 162.1
> lumRangeToJpeg_256_avx2: 101.9
> lumRangeToJpeg_512_c: 1289.1
> lumRangeToJpeg_512_sse4: 335.6
> lumRangeToJpeg_512_avx2: 201.4
> ---
>  libswscale/x86/range_convert.asm | 46 ++++++++++++++++++++++++++------
>  libswscale/x86/swscale.c         |  5 +++-
>  2 files changed, 42 insertions(+), 9 deletions(-)
>
> diff --git a/libswscale/x86/range_convert.asm b/libswscale/x86/range_convert.asm
> index 13983a386b..54c2f64769 100644
> --- a/libswscale/x86/range_convert.asm
> +++ b/libswscale/x86/range_convert.asm
[...]
> @@ -66,10 +66,19 @@ cglobal %1, 2, 3, 3, dst, width, x
>      paddd            m1, m5
>      psrad            m0, %4
>      psrad            m1, %4
> +%if mmsize == 16
>      packssdw         m0, m0
>      packssdw         m1, m1
>      movq    [dstq+xq*2], m0
>      movq    [dstq+xq*2+mmsize/2], m1
> +%else
> +    vextracti128    xm7, ym0, 1
> +    packssdw        xm0, xm7
> +    vextracti128    xm7, ym1, 1
> +    packssdw        xm1, xm7
> +    movdqu  [dstq+xq*2], xm0
> +    movdqu  [dstq+xq*2+mmsize/2], xm1
> +%endif
>      add              xq, mmsize / 2
>      cmp              xd, widthd
>      jl .loop

Is there a cleaner way to do this packing in avx2 (or a macro to have
the same code as non-avx2)? Also is there some cleaner way to move
half the register into memory (instead of having to ifdef between
mmsize)?


More information about the ffmpeg-devel mailing list