[FFmpeg-devel] [PATCH] VP8 MBedge H/V loopfilter MMX/MMX2/SSE2

Ronald S. Bultje rsbultje
Mon Jul 19 18:44:57 CEST 2010


Hi,

as per $subj again. All were tested to be bitexact w.r.t. C.

C:
13376 dezicycles in h mbedge 16, 512 runs, 0 skips
13727 dezicycles in h mbedge 8u/v, 512 runs, 0 skips
16878 dezicycles in v mbedge 16, 512 runs, 0 skips
17732 dezicycles in v mbedge 8u/v, 512 runs, 0 skips
13295 dezicycles in h mbedge 16, 1024 runs, 0 skips
13516 dezicycles in h mbedge 8u/v, 1023 runs, 1 skips
16758 dezicycles in v mbedge 16, 1024 runs, 0 skips
17658 dezicycles in v mbedge 8u/v, 1024 runs, 0 skips
13261 dezicycles in h mbedge 16, 2048 runs, 0 skips
13440 dezicycles in h mbedge 8u/v, 2046 runs, 2 skips
16652 dezicycles in v mbedge 16, 2047 runs, 1 skips
17432 dezicycles in v mbedge 8u/v, 2047 runs, 1 skips

MMX:
5778 dezicycles in h mbedge 16, 512 runs, 0 skips
4774 dezicycles in h mbedge 8u/v, 512 runs, 0 skips
3457 dezicycles in v mbedge 16, 512 runs, 0 skips
3469 dezicycles in v mbedge 8u/v, 512 runs, 0 skips
5565 dezicycles in h mbedge 16, 1024 runs, 0 skips
4714 dezicycles in h mbedge 8u/v, 1024 runs, 0 skips
3423 dezicycles in v mbedge 16, 1024 runs, 0 skips
3442 dezicycles in v mbedge 8u/v, 1024 runs, 0 skips
5516 dezicycles in h mbedge 16, 2048 runs, 0 skips
4713 dezicycles in h mbedge 8u/v, 2048 runs, 0 skips
3420 dezicycles in v mbedge 16, 2048 runs, 0 skips
3446 dezicycles in v mbedge 8u/v, 2047 runs, 1 skips

h16y: 2,3x faster
h8u/v: 2,8x faster
v16y: 4,8x faster
v8u/v: 5,1x faster
(as expected, v is significantly faster than h because it doesn't need
a transpose before the filter)

MMX2:
5487 dezicycles in h mbedge 16, 512 runs, 0 skips
4826 dezicycles in h mbedge 8u/v, 512 runs, 0 skips
3251 dezicycles in v mbedge 16, 512 runs, 0 skips
3257 dezicycles in v mbedge 8u/v, 512 runs, 0 skips
5343 dezicycles in h mbedge 16, 1024 runs, 0 skips
4669 dezicycles in h mbedge 8u/v, 1024 runs, 0 skips
3227 dezicycles in v mbedge 16, 1024 runs, 0 skips
3240 dezicycles in v mbedge 8u/v, 1024 runs, 0 skips
5336 dezicycles in h mbedge 16, 2047 runs, 1 skips
4625 dezicycles in h mbedge 8u/v, 2048 runs, 0 skips
3231 dezicycles in v mbedge 16, 2048 runs, 0 skips
3245 dezicycles in v mbedge 8u/v, 2048 runs, 0 skips

A few 10s of cycles faster than MMX, as expected, not much difference otherwise.

SSE2:

5699 dezicycles in h mbedge 16, 512 runs, 0 skips
4982 dezicycles in h mbedge 8u/v, 512 runs, 0 skips
3256 dezicycles in v mbedge 16, 511 runs, 1 skips
3284 dezicycles in v mbedge 8u/v, 512 runs, 0 skips
5442 dezicycles in h mbedge 16, 1024 runs, 0 skips
4744 dezicycles in h mbedge 8u/v, 1023 runs, 1 skips
3228 dezicycles in v mbedge 16, 1023 runs, 1 skips
3238 dezicycles in v mbedge 8u/v, 1024 runs, 0 skips
5365 dezicycles in h mbedge 16, 2048 runs, 0 skips
4646 dezicycles in h mbedge 8u/v, 2047 runs, 1 skips
3224 dezicycles in v mbedge 16, 2047 runs, 1 skips
3225 dezicycles in v mbedge 8u/v, 2048 runs, 0 skips

the H is significantly slower, as in all previous cases, so it's
disabled for "slow" SSE2. The V is just a tad faster so I left it
enabled, just like for the other loopfilters, even for "slow" SSE2
CPUs.

Some notes:
- I can probably split the "write 2x4/8word" parts out in a macro, if
wanted, I might be able to reuse them for the simple loopfilter if
this turns out to be faster.
- SSE4 could use pextrw here, but I don't have a SSE4 CPU so someone
else will have to do that and test for correctness. I am open for
donations of a new Macbook Pro that supports SSE4. ;-).
- I haven't tested pextrw->reg instead of the current movd->reg,
because the number of instructions would remain identical so I don't
think it makes a difference. Loren and Jason suggested it won't make
it faster on pre-SSE4 CPUs.
- if wanted, I can macro'ify the similarities between the setup code
and closedown code for inner and mbedge loopfilter code, this would
decrease source code size but not affect the binary. I haven't done it
yet since there's a few differences so it wouldn't be 100%
straightforward.
- With this patch, we should be able to do some initial performance
counts between ffvp8 and libvpx to see how we compare, speedwise.

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vp8_simd_mbedge_loopfilter.patch
Type: application/octet-stream
Size: 28460 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100719/a55f3ee9/attachment.obj>



More information about the ffmpeg-devel mailing list