[FFmpeg-devel] [PATCH]Fix alpha for iff deep

Sun Mar 11 16:25:08 CET 2012

On 11 Mar 2012, at 15:01, "Don Moir" <donmoir at comcast.net> wrote:
>>> 
>>> Seems to be a strange lack of any optimization.
>>> 
>>> if you look at this code in the patch:
>>> 
>>> +        for(x = 0; x < avctx->width; x++)
>>> +            row[4 * x + 3] = row[4 * x + 3] & 0xF0 | (row[4 * x + 3] >> 4);
>>> 
>>> would be better if:
>>> 
>>> +      row += 3;
>>> +      for(x = 0; x < avctx->width; x++, row+=4)
>>> +          *row = *row & 0xF0 | (*row >> 4);
>>> 
>>> But this is more of a minor point then the rest of the code being
>>> less than optimal in iff.c. It should be obvious but I could
>>> elaborate. I bring this up because sometimes optimization issues are
>>> sometime nic-picked to death and sometimes not.
>> 
>> There are several reasons for that.
>> In the case that you mention, it is because the code you propose
>> can easily be slower than the current code, particularly
>> on register-starved CISC configurations like 32 bit x86 with PIC.
>> The compiler might be able to encode the 4*x+3 into a single
>> instruction in the first case (thus it is "free"), whereas your
>> proposal might consume an extra register.
> 
> The MS compiler does a pretty good job on optimizing the above and the difference was pretty minor.
> 
> Assuming row is in eax:
> 
> In this case MS added 7 to the row offset first so reason you see BYTE PTR [eax - 4]
> 
> row[4 * x + 3] = row[4 * x + 3] & 0xF0 | (row[4 * x + 3] >> 4); compiled as:
> 
> mov  cl, BYTE PTR [eax-4]
> mov  dl, cl
> and  dl, 240   ; 000000f0H
> shr  cl, 4
> or  dl, cl
> mov  BYTE PTR [eax-4], dl
> 
> In this case the hints to compiler was enough to reduce [eax - 4] to [eax]

Which is completely pointless since both should have 100% the same performance.
Actually you didn't post the full loop, so it is unclear how the loop condition was encoded, which is the other point that compilers easily do badly when they don't understand the loop.

> I did say I thought this was a minor case because I figured the compiler would take care of the above case mostly.

I wanted to make clear that your suggested code is further away from the standard loop constructs and I've seen compiler create significantly worse code after that kind of "optimization".
Which is why I generally don't suggest that kind of thing any more, at least not without verifying a significant improvement with some compiler.