[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc
Justin Ruggles
justin.ruggles
Sun Jan 16 17:56:00 CET 2011
Hi,
On 01/15/2011 01:50 AM, Frank Barchard wrote:> On Fri, Jan 14, 2011 at
10:32 PM, Loren Merritt <lorenm at u.washington.edu>wrote:
>
>> On Fri, 14 Jan 2011, Justin Ruggles wrote:
>>
>> + /* round up to even multiple of 16 */
>>> + if (nb_coefs & 15)
>>> + nb_coefs = (nb_coefs & ~15) + 16;
>>>
>>
>> unconditional
>> nb_coefs = FFALIGN(nb_coefs, 16);
>>
>
> Loren is right. But FYI if you do it yourself, its
> nb_coefs = (nb_coefs + 15) & ~15;
changed.
>>
>> +%macro AC3_EXPONENT_MIN 1
>>> +cglobal ac3_exponent_min_%1, 3,4,3, exp, reuse_blks, offset, offset1
>>> + cmp reuse_blksq, 0
>>> + je .end
>>> + sal reuse_blksq, 8
>>> + sub offsetq, mmsize
>>> +.nextexp:
>>> + mov offset1q, offsetq
>>> + add offset1q, reuse_blksq
>>>
>>
>> lea
changed.
>> + mova m0, [expq+offsetq]
>>> +.nextblk:
>>> + mova m1, [expq+offset1q]
>>> +%ifidn %1, mmx
>>> + PMINUB_MMX m0, m1, m2
>>> +%else ; mmxext/sse/sse2
>>> + pminub m0, m1
>>>
>>
>> memory arg
changed.
>> +%endif
>>> + sub offset1q, 256
>>> + cmp offset1q, offsetq
>>>
>>
>> It is usually possible to arrange your pointers such that a loop ends with
>> an offset of 0, and then you can take the flags from the add/sub instead of
>> a separate cmp.
>>
>
> Or check for underflow. ie jns
>
> sub offset1q, 256
> js next
> top:
> ...
> sub offset1q, 256
> jns top
> next:
I don't think it's as simple as that for the inner loop in this case.
It doesn't decrement to 0, it decrements to the first block. If I make
offset1 lower by 256 and decrement to 0 it works, but then I have to add
256 when loading from memory, and it ends up being slower than the way I
have it currently.
>> + jne .nextblk
>>> + mova [expq+offsetq], m0
>>> + sub offsetq, mmsize
>>> + jge .nextexp
>>>
>>
> use unsigned cc if you can. It fusses on more cpus and does not use the
> overflow condition.
> jae nextexp
changed.
>> +.end:
>>> + REP_RET
>>> +%endmacro
>>> +
>>> +INIT_MMX
>>> +AC3_EXPONENT_MIN mmx
>>> +AC3_EXPONENT_MIN sse_mmxext
>>>
>>
>> mmx2 is a subset of sse; nothing should ever be tagged with both. In this
>> case, you're not using sse.
ah, ok. I knew there was some overlap, but I didn't know it was a strict
subset.
>> +%macro PMINUB_MMX 3 ; dst, src, tmp
>>> + mova %3, %1
>>> + pcmpgtb %1, %2
>>> + pand %2, %1
>>> + pandn %1, %3
>>> + por %1, %2
>>> +%endmacro
>>>
>>
>> I think you can simplify that using psubusb.
wow, thanks for the hint.
this works:
mova %3, %1
psubusb %3, %2
psubb %1, %3
and since %2 is not written to, it can use a memory arg
New patch attached.
Athlon64:
C: 38513
MMX: 5175
MMX2: 5430
SSE2: 2634
Atom:
C: 98582
MMX: 9957
MMX2: 9626
SSE2: 5623
-Justin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ac3_exponent_min.patch
Type: text/x-patch
Size: 12772 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110116/56d8f2a9/attachment.bin>
More information about the ffmpeg-devel
mailing list