[FFmpeg-devel] [PATCH] VP8 MMX optimizations (MC and IDCT dc_add)

Wed Jun 23 00:29:45 CEST 2010

On Tue, Jun 22, 2010 at 03:35:40PM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> as per $subj.
> 
> Speed gain:
> - dc_add goes from 1800 to 1350 cycles (where 1150 is overhead,
> measured as empty asm func), so about 3-3.5x faster.
> - The MC functions are each about 4-5x faster (I only measured the 4x4
> ones, the rest I assume are similarly faster but not measured).
> - Total time spent on a shell-script that decodes the whole testsuite
> (vp8-test-vectors-r1, file 001-017) including shell overhead and
> everything goers from 2.3 to 2.1 seconds with these applied.
> 
> Results are bit-identical, and this is my first MMX/etc. ever! Thanks
> to Jason for teaching me. ;-).
> 
> Ronald

[...]
> +; 4x4 block, H-only 4-tap filter
> +cglobal put_vp8_epel4_h4_mmxext, 5, 5
> +    sub        r0, r1
> +    movd      mm4, [fourtap_filter+r4*4-4] ; set up 4tap filter in words
> +    movd      mm5, [fourtap_filter+r4*4]
> +    movq      mm7, [ff_pw_64]
> +    pxor      mm6, mm6

> +    punpckldq mm4, mm4
> +    punpckldq mm5, mm5

you could avoid these by doing them to th table

[...]

> +; 4x4 block, V-only 4-tap filter
> +cglobal put_vp8_epel4_v4_mmxext, 4, 5
> +    mov        r4, r5m                     ; my - FIXME prevent this on X86_64
> +    sub        r0, r1
> +    movq      mm7, [fourtap_filter+r4*4-4] ; load 4-tap filter coeffs
> +    pxor      mm6, mm6
> +    movq      mm5, [ff_pw_64]
> +
> +    ; read 3 lines
> +    sub        r1, r2
> +    movd      mm0, [r1]
> +    movd      mm1, [r1+  r2]
> +    movd      mm2, [r1+2*r2]
> +    add        r1, r2
> +    punpcklbw mm0, mm6
> +    punpcklbw mm1, mm6
> +    punpcklbw mm2, mm6
> +
> +.nextrow
> +    ; first tap
> +    pshufw    mm3, mm7, 0x0                ; splat first coeff

are you sure all these pshufw are faster than reading them from a table?

> +    pmullw    mm3, mm0
> +

> +    ; update cache for second/third already
> +    movq      mm0, mm1
> +    movq      mm1, mm2

these could be avoided by unrolling the loop but i guess that makes it
too bloated?

[...]
> +cglobal vp8_idct_dc_add_mmx, 3, 3
> +    ; load data
> +    movd       mm0, [r1]
> +    pxor       mm2, mm2
> +    mov         r1, 4
> +
> +    ; calculate DC
> +    paddw      mm0, [ff_pw_4]
> +    punpcklwd  mm0, mm0
> +    punpckldq  mm0, mm0
> +    psraw      mm0, 3
> +
> +.nextblock
> +    ; add DC
> +    movd       mm1, [r0]
> +    punpcklbw  mm1, mm2
> +    paddw      mm1, mm0
> +
> +    ; write out
> +    packuswb   mm1, mm2
> +    movd      [r0], mm1

movq    mm0, [r0]
paddusb mm0, mm1
psubusb mm0, mm2
movq    mm0, [r0]

can be used to do this with 8 samples at once, aka 2 4x4 blocks

[...]

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

He who knows, does not speak. He who speaks, does not know. -- Lao Tsu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100623/4055fcc0/attachment.pgp>