[FFmpeg-devel] [PATCH] VP8 MMX optimizations (MC and IDCT dc_add)
Michael Niedermayer
michaelni
Wed Jun 23 00:29:45 CEST 2010
On Tue, Jun 22, 2010 at 03:35:40PM -0400, Ronald S. Bultje wrote:
> Hi,
>
> as per $subj.
>
> Speed gain:
> - dc_add goes from 1800 to 1350 cycles (where 1150 is overhead,
> measured as empty asm func), so about 3-3.5x faster.
> - The MC functions are each about 4-5x faster (I only measured the 4x4
> ones, the rest I assume are similarly faster but not measured).
> - Total time spent on a shell-script that decodes the whole testsuite
> (vp8-test-vectors-r1, file 001-017) including shell overhead and
> everything goers from 2.3 to 2.1 seconds with these applied.
>
> Results are bit-identical, and this is my first MMX/etc. ever! Thanks
> to Jason for teaching me. ;-).
>
> Ronald
[...]
> +; 4x4 block, H-only 4-tap filter
> +cglobal put_vp8_epel4_h4_mmxext, 5, 5
> + sub r0, r1
> + movd mm4, [fourtap_filter+r4*4-4] ; set up 4tap filter in words
> + movd mm5, [fourtap_filter+r4*4]
> + movq mm7, [ff_pw_64]
> + pxor mm6, mm6
> + punpckldq mm4, mm4
> + punpckldq mm5, mm5
you could avoid these by doing them to th table
[...]
> +; 4x4 block, V-only 4-tap filter
> +cglobal put_vp8_epel4_v4_mmxext, 4, 5
> + mov r4, r5m ; my - FIXME prevent this on X86_64
> + sub r0, r1
> + movq mm7, [fourtap_filter+r4*4-4] ; load 4-tap filter coeffs
> + pxor mm6, mm6
> + movq mm5, [ff_pw_64]
> +
> + ; read 3 lines
> + sub r1, r2
> + movd mm0, [r1]
> + movd mm1, [r1+ r2]
> + movd mm2, [r1+2*r2]
> + add r1, r2
> + punpcklbw mm0, mm6
> + punpcklbw mm1, mm6
> + punpcklbw mm2, mm6
> +
> +.nextrow
> + ; first tap
> + pshufw mm3, mm7, 0x0 ; splat first coeff
are you sure all these pshufw are faster than reading them from a table?
> + pmullw mm3, mm0
> +
> + ; update cache for second/third already
> + movq mm0, mm1
> + movq mm1, mm2
these could be avoided by unrolling the loop but i guess that makes it
too bloated?
[...]
> +cglobal vp8_idct_dc_add_mmx, 3, 3
> + ; load data
> + movd mm0, [r1]
> + pxor mm2, mm2
> + mov r1, 4
> +
> + ; calculate DC
> + paddw mm0, [ff_pw_4]
> + punpcklwd mm0, mm0
> + punpckldq mm0, mm0
> + psraw mm0, 3
> +
> +.nextblock
> + ; add DC
> + movd mm1, [r0]
> + punpcklbw mm1, mm2
> + paddw mm1, mm0
> +
> + ; write out
> + packuswb mm1, mm2
> + movd [r0], mm1
movq mm0, [r0]
paddusb mm0, mm1
psubusb mm0, mm2
movq mm0, [r0]
can be used to do this with 8 samples at once, aka 2 4x4 blocks
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
He who knows, does not speak. He who speaks, does not know. -- Lao Tsu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100623/4055fcc0/attachment.pgp>
More information about the ffmpeg-devel
mailing list