[FFmpeg-devel] [PATCH] Add x86-optimized versions of exponent_min().

Fri Feb 4 02:59:48 CET 2011

On 02/03/2011 07:13 PM, Justin Ruggles wrote:

> On 02/03/2011 06:47 PM, Loren Merritt wrote:
> 
>> On Thu, 3 Feb 2011, Justin Ruggles wrote:
>>> So should we just accept what is an obvious bad case on one 
>>> configuration because there is a chance that fixing it is worse 
>>> in another?
>>
>> My expectation of the effect of this fix on the performance of the 
>> configurations you haven't benchmarked, is positive. If you don't want to 
>> benchmark them, I won't reject this patch on those grounds.
>>
>> I am merely saying that as long as you haven't identified the actual 
>> cause of the slowdowns, as long as performance is still random unto you, 
>> making decisions based on a thorough benchmark of only one compiler 
>> configuration is generalizing from one data point.
>>
>>> Even the worst case versions are 80-90% faster than the C version in the 
>>> tested configuration (x86_64 unix). Is it likely that the worst case 
>>> will be much slower in another?
>>
>> Not more than 40% slower. (Some confidence since on this question your 
>> benchmark counts as 24 data points, not 1.)
> 
> 
> I can recompile with "--extra-cflags=-m32 --extra-ldflags=-m32" and add
> 24 more data points if you think this would be useful.

Results for x86_32:

LOOP1/LOOP2   MMX   MMX2   SSE2
-------------------------------
NONE/NONE :  5150   4640   2735
   NONE/8 :  5240   3716   2343
  NONE/16 :  5270   3713*  2360
   8/NONE :  5123   3765   2899
      8/8 :  4970   5295   2793
     8/16 :  5911   4361   2469
  16/NONE :  4902*  4860   2696
     16/8 :  5381   3922   2228
    16/16 :  5382   3954   2226*

And again, the results for x86_64:

LOOP1/LOOP2   MMX   MMX2   SSE2
-------------------------------
NONE/NONE :  5270   5283   2757
   NONE/8 :  5200   5077   2644
  NONE/16 :  5723   3961   2161
   8/NONE :  5214   5339   2787
      8/8 :  5198*  5083   2722
     8/16 :  5936   3902   2128
  16/NONE :  6613   4788   2580
     16/8 :  5490   3702   2020
    16/16 :  5474   3680*  2000*

So this is definitely not conclusive. :(

One thing that is consistent is that no matter what the alignment of the
first loop is, increasing the alignment for the 2nd loop gives better
results for mmx2 and sse2.

I would be ok with doing nothing for mmx since it is wildly inconsistent
and either only aligning the 2nd loop for mmx2 and sse2 or aligning both
loops.

-Justin