[FFmpeg-devel] [PATCH] unroll loop in h264_idct_add8_sse2()
Ronald S. Bultje
rsbultje
Sat Sep 18 19:22:08 CEST 2010
Hi,
see attached. The unroll itself doesn't help much, but allows us to
stop using scan8[] (or rather inline the values directly in the code),
which has a positive effect on performance. The same trick speeds up
idct_add16() by approximately 25%, so using it elsewhere appears to
make sense.
Using OSX 10.6.4, cathedral sample.
before
943 dezicycles in idct_add8, 131047 runs, 25 skips
911 dezicycles in idct_add8, 262110 runs, 34 skips
848 dezicycles in idct_add8, 524244 runs, 44 skips
767 dezicycles in idct_add8, 1048521 runs, 55 skips
time
8.297
8.269
8.307
8.286
8.330
(avg 8.298)
after
691 dezicycles in idct_add8, 131066 runs, 6 skips
670 dezicycles in idct_add8, 262136 runs, 8 skips
646 dezicycles in idct_add8, 524275 runs, 13 skips
608 dezicycles in idct_add8, 1048552 runs, 24 skips
(i.e. ~20% faster in this function)
time:
8.178
8.287
8.158
8.277
8.267
(avg 8.233 = ~0.8% faster overall)
The same trick can likely be applied to add16intra as well. (It could
likely also be done for pre-SSE2, but I doubt that's used much in
reality...)
Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264-idct-inline.patch
Type: application/octet-stream
Size: 2384 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100918/f7017e07/attachment.obj>
More information about the ffmpeg-devel
mailing list