[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2

Sun Jul 11 17:53:15 CEST 2010

Hi,

as per $subj. All tested to be identical to C reference. If wanted, I
can try to share parts of the filter code with the simple loopfilter,
but I'm a little scared that it'll turn into massive spaghetti so I
didn't do it yet.

All speeds measured on a CoreDuo 2GHz, OSX10.6, gcc 4.2.1-5646 (Apple build).

V loopfilter (this is the simple one, just loopfilter + store):
mmxext: 638 cycles
sse2: 612
mmx: 640
C: 4156
I.e. SIMD is 6.5-7x faster. You also notice that sse2 isn't that much
faster, as was the case in previous optimization attempts for other
functions also. On real SSE2 CPUs, this should be a lot faster than on
my shitty CPU.

H loopfilter (MMX/SSE2: 8x8/8x16 transpose, V loopfilter, 4x4 transpose+store)
mmx: 844 cycles
C: 3457
mmxext: 830
sse2: 948

You'll notice that the sse2 is significantly slower here, my rough
guess is that this is because of my shitty CPU which pretty much
emulates xmm-ops through mmx-ops, so it doesn't add a lot of benefit
other than not having to setup the loop for doing the second 8 pixels,
combined with the added complexity of a 8x16 transpose before the
actual filter. I'm betting that on an actual sse2-supporting CPU
(Jason?), this would still be faster, but we might want to put this
under a FF_MM_SSE2_NOT_SHITTY flag or something along those lines. If
you think my code is shitty, comments are welcome also. ;-).

I'll be working on the MBedge luma loopfilter next, and will leave the
chroma ones for last since I might change function prototypes a
little...

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vp8_inner_loop_filter16.patch
Type: application/octet-stream
Size: 16254 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100711/41c4db83/attachment.obj>