[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc
Justin Ruggles
justin.ruggles
Mon Jan 17 01:50:38 CET 2011
On 01/16/2011 07:14 PM, Justin Ruggles wrote:
> On 01/16/2011 02:52 PM, Loren Merritt wrote:
>
>> On Sun, 16 Jan 2011, Justin Ruggles wrote:
>>
>>>>>> + sub offset1q, 256
>>>>>> + cmp offset1q, offsetq
>>>>>
>>>>> It is usually possible to arrange your pointers such that a loop ends with
>>>>> an offset of 0, and then you can take the flags from the add/sub instead of
>>>>> a separate cmp.
>>>>
>>>> Or check for underflow. ie jns
>>>>
>>>> sub offset1q, 256
>>>> js next
>>>> top:
>>>> ...
>>>> sub offset1q, 256
>>>> jns top
>>>> next:
>>>
>>> I don't think it's as simple as that for the inner loop in this case.
>>> It doesn't decrement to 0, it decrements to the first block. If I make
>>> offset1 lower by 256 and decrement to 0 it works, but then I have to add
>>> 256 when loading from memory, and it ends up being slower than the way I
>>> have it currently.
>>
>> The first iteration that doesn't run is when offset1q goes negative.
>> That's good enough. Just remove the cmp and change jne to jae.
>
> The first iteration that doesn't run is when offset1q == offsetq, and
> offsetq is always 0 to [80..256]-mm_size.
>
>>
>> Or for the general case, don't undo the munging in the inner loop, munge
>> the base pointer. Applying that to this function produces
>>
>> %macro AC3_EXPONENT_MIN 1
>> cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, expn
>> cmp reuse_blksd, 0
>> je .end
>> sal reuse_blksd, 8
>> mov expnd, reuse_blksd
>> .nextexp:
>> mov offsetd, reuse_blksd
>> mova m0, [expq]
>> .nextblk:
>> %ifidn %1, mmx
>> PMINUB_MMX m0, [expq+offsetq], m1
>> %else ; mmxext/sse2
>> pminub m0, [expq+offsetq]
>> %endif
>> sub offsetd, 256
>> jae .nextblk
>> mova [expq], m0
>> add expq, mmsize
>> sub expnd, mmsize
>> jae .nextexp
>> .end:
>> REP_RET
>> %endmacro
>>
>> ... which is 6x slower on Conroe x86_64, so I must have done something wrong.
>
>
> Yeah, it's wrong in several ways. The outer loop is supposed to run
> offset/mmsize times (offset is 80 to 256), step mmsize. The inner loop
> is supposed to run reuse_blks times, step 256, for each outer loop
> iteration.
>
> Reversing the outer loop seems unrelated to what you've mentioned. I
> don't see how it helps. Is it actually faster to have an extra add
> instead of an offset in the load and store?
I tried this, and while the code certainly looks cleaner, the speed is
the same.
%macro AC3_EXPONENT_MIN 1
cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, exp1
cmp reuse_blksq, 0
je .end
sal reuse_blksq, 8
sub offsetq, mmsize
.nextexp:
lea exp1q, [expq+reuse_blksq]
mova m0, [expq]
.nextblk:
%ifidn %1, mmx
PMINUB_MMX m0, [exp1q], m1
%else ; mmxext/sse2
pminub m0, [exp1q]
%endif
sub exp1q, 256
cmp exp1q, expq
jne .nextblk
mova [expq], m0
add expq, mmsize
sub offsetq, mmsize
jae .nextexp
.end:
REP_RET
%endmacro
> I think I get what you mean about adjusting base pointer though. I'll
> try it.
That actually made it slower by about 100 dezicycles on Athlon64. Not as
bad as adjusting the offset in the inner loop, but still slower than the
extra cmp.
Also, changing those q to d makes it slower for me on Athlon64.
-Justin
More information about the ffmpeg-devel
mailing list