[FFmpeg-devel] [PATCH] SSE2 and SSSE3 versions of h264 biweight prediction code (biweight_h264_pixels_tab)
Ronald S. Bultje
rsbultje
Thu Jul 29 18:23:01 CEST 2010
Hi,
On Thu, Jul 29, 2010 at 12:32 AM, Eli Friedman <eli.friedman at gmail.com> wrote:
> Patch attached. ?Loosely based off of the MMX2 version. ?Around 1%
> faster overall on a test file on my Mobile Core i5.
[..]
> +cglobal h264_biweight_8x8_ssse3, 7, 7, 8
> + BIWEIGHT_SSSE3_SETUP
> + mov r3, 4
> +
> +.nextrow
> + BIWEIGHT_SSSE3_OP r2
> + movh [r0], m0
> + movhps [r0+r2], m0
> + lea r0, [r0+r2*2]
> + lea r1, [r1+r2*2]
> + dec r3
> + jnz .nextrow
> + REP_RET
You have several unused r%d regs here, maybe you want to use lea r4,
[r2*2] and then use add r0/r1, r4 instead of lea, that should result
in slightly smaller code. Same for h264_biweight_8x8_sse2.
> +%macro BIWEIGHT_SSSE3_OP 1
> + movh m0, [r0]
> + movh m1, [r1]
> + movh m2, [r0+%1]
> + movh m3, [r1+%1]
> + punpcklbw m0, m1
> + punpcklbw m2, m3
If you don't use m1/m3 afterwards, you can IIRC just punpcklbw m0,
[r0+%1] and same for the line below.
> +%macro BIWEIGHT_SSSE3_SETUP 0
[..]
> + movd m4, r4
> + movd m0, r5
> + movd m5, r6
> + movd m6, r3
> + pslld m5, m6
> + psrld m5, 1
> + punpcklbw m4, m0
> + pshuflw m4, m4, 0
> + pshuflw m5, m5, 0
> + punpcklqdq m4, m4
> + punpcklqdq m5, m5
> +%endmacro
I wonder if pshufb plus some magic third value would help here
(haven't thought about this a lot, but something like pshufb m4, m4,
[pw_1] or so should do the same as pshuflw+punpcklqdq?). Not sure if
it's faster...
Nice work!
Ronald
More information about the ffmpeg-devel
mailing list