[Ffmpeg-devel] [PATCH] h264 optimization: common case hl_decode_mb
Alexander Strange
astrange
Fri Feb 23 08:08:26 CET 2007
I noticed that hl_decode_mb is near the top of profiling the h264
decoder and is full of huge conditionals.
This patch copies the function, with a new version that runs for the
common case: no interlacing, grayscale decoding disabled, not
encoding, and not decoding SVQ3.
It has a very small, but significant speed gain on my test video,
which is 1080p and 1.2MBit with I/P frames:
BENCHMARKs: VC: 25.189s VO: 1.906s A: 0.000s Sys: 0.181s =
27.277s
BENCHMARKs: VC: 25.188s VO: 1.889s A: 0.000s Sys: 0.180s =
27.257s
BENCHMARKs: VC: 25.195s VO: 1.897s A: 0.000s Sys: 0.181s =
27.273s
BENCHMARKs: VC: 25.192s VO: 1.898s A: 0.000s Sys: 0.182s =
27.271s
avg 25.101 +/- .003162
BENCHMARKs: VC: 24.926s VO: 1.903s A: 0.000s Sys: 0.182s =
27.010s
BENCHMARKs: VC: 24.927s VO: 1.903s A: 0.000s Sys: 0.182s =
27.012s
BENCHMARKs: VC: 24.926s VO: 1.900s A: 0.000s Sys: 0.182s =
27.008s
BENCHMARKs: VC: 24.924s VO: 1.898s A: 0.000s Sys: 0.181s =
27.003s
avg 24.9258 +/- .001258
This is a 2.16GHz Intel Core Duo, so I expect most other people will
see a bigger change.
hl_decode_mb_simple is 880 instructions vs. 2018 for the general one.
_simple inlines backup_mb_border and xchg_mb_border, which still have
checks for grayscale. For some reason when I removed them it actually
got slower. I guess this is because it gives gcc's register allocator
more live variables at once?
Any comments on this are appreciated.
BTW, other high functions in profiles for me are:
* backup_mb_border and xchg_mb_border again. I don't see any easy
wins here. All these giant arrays and pointer arithmetic can't be
good, though.
* decode_cabac_residual is already mostly in assembler and I don't
want to touch it; I'd like to know why the C and asm versions of
decode_significance use different offset arrays, though.
* fill_caches. This one is also huge and large parts are interlacing-
only. Maybe the same thing could be done as in this patch.
* filter_mb_edge*
* hl_motion has a lot of L2 cache misses even with the prefetching. I
wonder if it should be using non-temporal prefetch (prefetchnt0,
don't keep data in the cache after it's used) instead of the default
one it does now?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ffmpeg-hldecodemb-simple.diff
Type: application/octet-stream
Size: 8412 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070223/5c04f1dd/attachment.obj>
More information about the ffmpeg-devel
mailing list