[FFmpeg-devel] [PATCH] SSE2 and SSSE3 versions of h264 biweight prediction code (biweight_h264_pixels_tab)

Ronald S. Bultje rsbultje
Thu Jul 29 18:23:01 CEST 2010


Hi,

On Thu, Jul 29, 2010 at 12:32 AM, Eli Friedman <eli.friedman at gmail.com> wrote:
> Patch attached. ?Loosely based off of the MMX2 version. ?Around 1%
> faster overall on a test file on my Mobile Core i5.
[..]
> +cglobal h264_biweight_8x8_ssse3, 7, 7, 8
> +    BIWEIGHT_SSSE3_SETUP
> +    mov        r3, 4
> +
> +.nextrow
> +    BIWEIGHT_SSSE3_OP r2
> +    movh       [r0], m0
> +    movhps     [r0+r2], m0
> +    lea        r0, [r0+r2*2]
> +    lea        r1, [r1+r2*2]
> +    dec        r3
> +    jnz .nextrow
> +    REP_RET

You have several unused r%d regs here, maybe you want to use lea r4,
[r2*2] and then use add r0/r1, r4 instead of lea, that should result
in slightly smaller code. Same for h264_biweight_8x8_sse2.

> +%macro BIWEIGHT_SSSE3_OP 1
> +    movh       m0, [r0]
> +    movh       m1, [r1]
> +    movh       m2, [r0+%1]
> +    movh       m3, [r1+%1]
> +    punpcklbw  m0, m1
> +    punpcklbw  m2, m3

If you don't use m1/m3 afterwards, you can IIRC just punpcklbw m0,
[r0+%1] and same for the line below.

> +%macro BIWEIGHT_SSSE3_SETUP 0
[..]
> +    movd       m4, r4
> +    movd       m0, r5
> +    movd       m5, r6
> +    movd       m6, r3
> +    pslld      m5, m6
> +    psrld      m5, 1
> +    punpcklbw  m4, m0
> +    pshuflw    m4, m4, 0
> +    pshuflw    m5, m5, 0
> +    punpcklqdq m4, m4
> +    punpcklqdq m5, m5
> +%endmacro

I wonder if pshufb plus some magic third value would help here
(haven't thought about this a lot, but something like pshufb m4, m4,
[pw_1] or so should do the same as pshuflw+punpcklqdq?). Not sure if
it's faster...

Nice work!

Ronald



More information about the ffmpeg-devel mailing list