[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc

Sun Jan 16 17:56:00 CET 2011

Hi,

On 01/15/2011 01:50 AM, Frank Barchard wrote:> On Fri, Jan 14, 2011 at
10:32 PM, Loren Merritt <lorenm at u.washington.edu>wrote:

> 
>> On Fri, 14 Jan 2011, Justin Ruggles wrote:
>>
>>  +    /* round up to even multiple of 16 */
>>> +    if (nb_coefs & 15)
>>> +        nb_coefs = (nb_coefs & ~15) + 16;
>>>
>>
>> unconditional
>> nb_coefs = FFALIGN(nb_coefs, 16);
>>
> 
> Loren is right.  But FYI if you do it yourself, its
> nb_coefs = (nb_coefs + 15)  & ~15;

changed.

>>
>>  +%macro AC3_EXPONENT_MIN 1
>>> +cglobal ac3_exponent_min_%1, 3,4,3, exp, reuse_blks, offset, offset1
>>> +    cmp  reuse_blksq, 0
>>> +    je .end
>>> +    sal  reuse_blksq, 8
>>> +    sub      offsetq, mmsize
>>> +.nextexp:
>>> +    mov     offset1q, offsetq
>>> +    add     offset1q, reuse_blksq
>>>
>>
>> lea

changed.

>>  +    mova          m0, [expq+offsetq]
>>> +.nextblk:
>>> +    mova          m1, [expq+offset1q]
>>> +%ifidn %1, mmx
>>> +    PMINUB_MMX    m0, m1, m2
>>> +%else ; mmxext/sse/sse2
>>> +    pminub        m0, m1
>>>
>>
>> memory arg

changed.

>>  +%endif
>>> +    sub     offset1q, 256
>>> +    cmp     offset1q, offsetq
>>>
>>
>> It is usually possible to arrange your pointers such that a loop ends with
>> an offset of 0, and then you can take the flags from the add/sub instead of
>> a separate cmp.
>>
> 
> Or check for underflow.  ie jns
> 
>  sub     offset1q, 256
>  js       next
> top:
>  ...
>  sub     offset1q, 256
>  jns      top
> next:

I don't think it's as simple as that for the inner loop in this case.
It doesn't decrement to 0, it decrements to the first block.  If I make
offset1 lower by 256 and decrement to 0 it works, but then I have to add
256 when loading from memory, and it ends up being slower than the way I
have it currently.

>>  +    jne .nextblk
>>> +    mova [expq+offsetq], m0
>>> +    sub      offsetq, mmsize
>>> +    jge .nextexp
>>>
>>
> use unsigned cc if you can.  It fusses on more cpus and does not use the
> overflow condition.
> jae nextexp

changed.

>> +.end:
>>> +    REP_RET
>>> +%endmacro
>>> +
>>> +INIT_MMX
>>> +AC3_EXPONENT_MIN mmx
>>> +AC3_EXPONENT_MIN sse_mmxext
>>>
>>
>> mmx2 is a subset of sse; nothing should ever be tagged with both. In this
>> case, you're not using sse.

ah, ok. I knew there was some overlap, but I didn't know it was a strict
subset.

>>  +%macro PMINUB_MMX 3 ; dst, src, tmp
>>> +    mova     %3, %1
>>> +    pcmpgtb  %1, %2
>>> +    pand     %2, %1
>>> +    pandn    %1, %3
>>> +    por      %1, %2
>>> +%endmacro
>>>
>>
>> I think you can simplify that using psubusb.

wow, thanks for the hint.
this works:
mova     %3, %1
psubusb  %3, %2
psubb    %1, %3

and since %2 is not written to, it can use a memory arg

New patch attached.

Athlon64:
   C: 38513
 MMX:  5175
MMX2:  5430
SSE2:  2634

Atom:
   C: 98582
 MMX:  9957
MMX2:  9626
SSE2:  5623

-Justin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ac3_exponent_min.patch
Type: text/x-patch
Size: 12772 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110116/56d8f2a9/attachment.bin>