[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc

Mon Jan 17 01:50:38 CET 2011

On 01/16/2011 07:14 PM, Justin Ruggles wrote:

> On 01/16/2011 02:52 PM, Loren Merritt wrote:
> 
>> On Sun, 16 Jan 2011, Justin Ruggles wrote:
>>
>>>>>> +    sub     offset1q, 256
>>>>>> +    cmp     offset1q, offsetq
>>>>>
>>>>> It is usually possible to arrange your pointers such that a loop ends with
>>>>> an offset of 0, and then you can take the flags from the add/sub instead of
>>>>> a separate cmp.
>>>>
>>>> Or check for underflow.  ie jns
>>>>
>>>>  sub     offset1q, 256
>>>>  js       next
>>>> top:
>>>>  ...
>>>>  sub     offset1q, 256
>>>>  jns      top
>>>> next:
>>>
>>> I don't think it's as simple as that for the inner loop in this case.
>>> It doesn't decrement to 0, it decrements to the first block.  If I make
>>> offset1 lower by 256 and decrement to 0 it works, but then I have to add
>>> 256 when loading from memory, and it ends up being slower than the way I
>>> have it currently.
>>
>> The first iteration that doesn't run is when offset1q goes negative. 
>> That's good enough. Just remove the cmp and change jne to jae.
> 
> The first iteration that doesn't run is when offset1q == offsetq, and
> offsetq is always 0 to [80..256]-mm_size.
> 
>>
>> Or for the general case, don't undo the munging in the inner loop, munge 
>> the base pointer. Applying that to this function produces
>>
>> %macro AC3_EXPONENT_MIN 1
>> cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, expn
>>      cmp  reuse_blksd, 0
>>      je .end
>>      sal  reuse_blksd, 8
>>      mov        expnd, reuse_blksd
>> .nextexp:
>>      mov      offsetd, reuse_blksd
>>      mova          m0, [expq]
>> .nextblk:
>> %ifidn %1, mmx
>>      PMINUB_MMX    m0, [expq+offsetq], m1
>> %else ; mmxext/sse2
>>      pminub        m0, [expq+offsetq]
>> %endif
>>      sub      offsetd, 256
>>      jae .nextblk
>>      mova      [expq], m0
>>      add         expq, mmsize
>>      sub        expnd, mmsize
>>      jae .nextexp
>> .end:
>>      REP_RET
>> %endmacro
>>
>> ... which is 6x slower on Conroe x86_64, so I must have done something wrong.
> 
> 
> Yeah, it's wrong in several ways.  The outer loop is supposed to run
> offset/mmsize times (offset is 80 to 256), step mmsize.  The inner loop
> is supposed to run reuse_blks times, step 256, for each outer loop
> iteration.
> 
> Reversing the outer loop seems unrelated to what you've mentioned.  I
> don't see how it helps.  Is it actually faster to have an extra add
> instead of an offset in the load and store?

I tried this, and while the code certainly looks cleaner, the speed is
the same.

%macro AC3_EXPONENT_MIN 1
cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, exp1
    cmp  reuse_blksq, 0
    je .end
    sal  reuse_blksq, 8
    sub      offsetq, mmsize
.nextexp:
    lea        exp1q, [expq+reuse_blksq]
    mova          m0, [expq]
.nextblk:
%ifidn %1, mmx
    PMINUB_MMX    m0, [exp1q], m1
%else ; mmxext/sse2
    pminub        m0, [exp1q]
%endif
    sub        exp1q, 256
    cmp        exp1q, expq
    jne .nextblk
    mova      [expq], m0
    add         expq, mmsize
    sub      offsetq, mmsize
    jae .nextexp
.end:
    REP_RET
%endmacro

> I think I get what you mean about adjusting base pointer though.  I'll
> try it.

That actually made it slower by about 100 dezicycles on Athlon64. Not as
bad as adjusting the offset in the inner loop, but still slower than the
extra cmp.

Also, changing those q to d makes it slower for me on Athlon64.

-Justin