[Ffmpeg-devel] [PATCH] h264 optimization: common case hl_decode_mb

Fri Feb 23 08:08:26 CET 2007

I noticed that hl_decode_mb is near the top of profiling the h264  
decoder and is full of huge conditionals.

This patch copies the function, with a new version that runs for the  
common case: no interlacing, grayscale decoding disabled, not  
encoding, and not decoding SVQ3.

It has a very small, but significant speed gain on my test video,  
which is 1080p and 1.2MBit with I/P frames:
BENCHMARKs: VC:  25.189s VO:   1.906s A:   0.000s Sys:   0.181s =    
27.277s
BENCHMARKs: VC:  25.188s VO:   1.889s A:   0.000s Sys:   0.180s =    
27.257s
BENCHMARKs: VC:  25.195s VO:   1.897s A:   0.000s Sys:   0.181s =    
27.273s
BENCHMARKs: VC:  25.192s VO:   1.898s A:   0.000s Sys:   0.182s =    
27.271s
avg 25.101 +/- .003162

BENCHMARKs: VC:  24.926s VO:   1.903s A:   0.000s Sys:   0.182s =    
27.010s
BENCHMARKs: VC:  24.927s VO:   1.903s A:   0.000s Sys:   0.182s =    
27.012s
BENCHMARKs: VC:  24.926s VO:   1.900s A:   0.000s Sys:   0.182s =    
27.008s
BENCHMARKs: VC:  24.924s VO:   1.898s A:   0.000s Sys:   0.181s =    
27.003s
avg 24.9258 +/- .001258

This is a 2.16GHz Intel Core Duo, so I expect most other people will  
see a bigger change.

hl_decode_mb_simple is 880 instructions vs. 2018 for the general one.

_simple inlines backup_mb_border and xchg_mb_border, which still have  
checks for grayscale. For some reason when I removed them it actually  
got slower. I guess this is because it gives gcc's register allocator  
more live variables at once?

Any comments on this are appreciated.

BTW, other high functions in profiles for me are:
* backup_mb_border and xchg_mb_border again. I don't see any easy  
wins here. All these giant arrays and pointer arithmetic can't be  
good, though.
* decode_cabac_residual is already mostly in assembler and I don't  
want to touch it; I'd like to know why the C and asm versions of  
decode_significance use different offset arrays, though.
* fill_caches. This one is also huge and large parts are interlacing- 
only. Maybe the same thing could be done as in this patch.
* filter_mb_edge*
* hl_motion has a lot of L2 cache misses even with the prefetching. I  
wonder if it should be using non-temporal prefetch (prefetchnt0,  
don't keep data in the cache after it's used) instead of the default  
one it does now?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ffmpeg-hldecodemb-simple.diff
Type: application/octet-stream
Size: 8412 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070223/5c04f1dd/attachment.obj>