[FFmpeg-devel] low and not so low hanging h264 fruits
Sun Feb 14 04:20:03 CET 2010
Id like you dear reader to help in optimizing h264, so please pick an
idea from below and work on it.
1. The direct temporal mv generation code (stuff like "scale * my_col + 128")
works with 2 values at once and looks like a good candidate for mmx
this one is easy
2. Our direct temporal & spatial MV generation code works with either 1 16x16
or 4 8x8 blocks, try adding code for the 2 block case (16x8/8x16)
this is easy too but no gurantee that its a win speedwise, it could be
slower due to being more code.
3. mb_stride, b4_stride, b8_stride whatever, our decoder is full of them
change them to a named macro and make it a constant this would reduce
the amount of reading these from context, less register pressure and
addressing values above and below change from [2+2*b4_stride] to [constant]
may or may not be easy
4. interleave code from fill_decode_caches and the mb decode functions calling
that so that branches are reduced as well as code is being excuted less
often. An example would be dark shikaris suggestion of not setting
non_zero_count_cache if cbp is 0.
This will likely be a "argh why doesnt that work" requireing some analysis
of where things are set and used and what can and cannot be move where.
Also no gurantee that this is faster at all, changes register pressure and
more complex code accessing more different things + gcc could kill the gains
Something similar could also be tried with the fill_filter_caches and the
Also i might be working on parts of this, more specifically the fill+cabac
relative stuff, cabac is doing some seriously redundant looking things that
i plan to work on soon
5. simply going over all the if() finding ones that are poorly predictable
and trying to replace them by branchless code where it is faster and
6. our Motion Compensation code works directly from the pictures, it is
possible that in some cases it would be faster to use a intermediate
halfperl interpolated image. This should be especially for small blocks
and bidirectionally predicted blocks
This likely is not easy, and ideally should be adaptively selected
depending on picture content (use last pictures motion vectors to predict
which way is better for the current picture. Or maybe have some kind of
cache that calcuates and reuses halfpel values between blocks but doesnt
cause any to be calculated if no block needs them ...
I have many more ideas, ill post them once some of these are done
PS: a SOC h264 optimizing project would be a good idea too with qualification
task of making our decoder at least 1% faster.
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: Digital signature
More information about the ffmpeg-devel