[FFmpeg-devel] [PATCH] avoid mb_xy recalculation in h264

Sun May 11 01:40:44 CEST 2008

On May 10, 2008, at 7:56 AM, Michael Niedermayer wrote:

> On Wed, May 07, 2008 at 10:42:04PM -0400, Alexander Strange wrote:
>>
>> On May 7, 2008, at 3:15 PM, Michael Niedermayer wrote:
>>
>>> On Wed, May 07, 2008 at 02:49:20PM -0400, Alexander Strange wrote:
>>>>
>>>> On May 7, 2008, at 8:53 AM, Michael Niedermayer wrote:
>>>>
>>>>> On Wed, May 07, 2008 at 03:05:47AM -0400, Alexander Strange wrote:
>>>>>> This line appears in a lot of h264.c functions:
>>>>>> const int mb_xy= s->mb_x + s->mb_y*s->mb_stride;
>>>>>>
>>>>>> It only changes once per MB, so we can add it to the context and
>>>>>> only
>>>>>> recalculate it in decode_slice().
>>>>>> I put it at the end of H264Context, but it could be moved up
>>>>>> earlier if
>>>>>> someone likes it there. The position doesn't seem to affect
>>>>>> performance
>>>>>> much; it can't be moved next to mb_x/mb_y since that would put it
>>>>>> in
>>>>>> MpegEncContext, and none of the other codecs use it.
>>>>>>
>>>>>> Patches:
>>>>>> 1- does that, updating it for MBAFF when needed
>>>>>> 2- removes some newly unused variables
>>>>>> 3- does the same thing to svq3.c
>>>>>>
>>>>>> Before: avg 8.799s max 8.8s min 8.794s
>>>>>> After: avg 8.645s max 8.651s min 8.641s
>>>>>
>>>>> Knowing the cpu and compiler would also be usefull.
>>>>
>>>> gcc 4.0.1, core 2 x86-32 Darwin
>>>>
>>>> I guess it's about the same as the usual distro compiler, but this
>>>> isn't
>>>> really a sensitive messing-with-the-register-allocator patch.
>>>
>>> How does the change compare to a splited (litterally duplicated)
>>> decode_cabac_residual()
>>> with the mb_xy only calculated for the "cat"s which actually need  
>>> it?
>>> That is one for cat=0/3 and one for the others
>>> or
>>> with the mb_xy calc moved in the if(cat=0/3) which use it?
>>
>> Actually, the compiled decode_cabac_residual is barely different at
>> all. gcc loads h->mb_xy, immediately spills it to the same place in
>> the stack, and the rest of the function is the same. I suppose
>> changing mb_xy to h->mb_xy for the DC cats might be an improvement.
>>
>> Diffing the asm shows:
>> * less imull
>> * hl_motion is different (mb_x load moved into an if())
>> * write_back_intra_pred_mode() is inlined and removed
>> * slightly different stack sizes (hard to know if that means  
>> anything)
>>
>> Making write_back_intra_pred_mode() always_inline doesn't change  
>> cavlc
>> time, but explains about half the speed change for cabac.
>
> My question is still the same, how does adding mb_xy into the context
> compare (speedwise/benchmark) to changing the decode_cabac_residual()
> so it does not use mb_xy 90% of the time?
>
> I think the main speed difference (besides inline changes) is in
> decode_cabac_residual()
> its kinda logic if you consider that mb_xy is calculated 16 times  
> alone
> for the luma blocks but not used at all. Changing that to loading it  
> from the
> context is not correct. decode_cabac_residual() should not touch mb_xy
> when it does not need it.
>
> That is what id like to see is
> How much speed difference there is between svn and the calculation of
> mb_xy moved into the cats which need it?
> How much speed gain we can achive by making  
> write_back_intra_pred_mode()
> inline if any.
> What speed differenc remains between the best combination of above  
> and in
> addition putting mb_xy in the context?
>
> What iam trying to do is chop the changes up into small parts and  
> see their
> individual contriution, so we do not commit something which might  
> have a
> negative speed impact by mistake.
> Also if we know if X is better inlined or not inlined we can force gcc
> to always do what is better.

I tried it with diffferent benchmarking methods and compilers (gcc 4.2).

As is -
avg 10.933s min 10.932s max 10.934s

Moving mb_xy calculation into if (cat == 0/3) -
avg 10.89s min 10.888s max 10.892s

Using h->mb_xy just in if (cat == 0/3) (attached patch) -
avg 10.888s min 10.887s max 10.891s

Original patch -
avg 10.856s min 10.855s max 10.859s

Patch + attached patch -
avg 10.826s min 10.823s max 10.828s

Doing it in pieces (only using it in hl_motion(), and then adding it  
to decode_cabac_residual()) had the expected smaller improvements.  
Plus it already did all the right inlining, so these results are  
independent of that. Not sure why the last one improved more with the  
other stuff already applied, but it's the same across another compiler  
and more videos.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4-catmbxy.diff
Type: text/x-diff
Size: 963 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080510/7d01bc4f/attachment.diff>
-------------- next part --------------