[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc
Loren Merritt
lorenm
Sun Jan 16 20:52:28 CET 2011
On Sun, 16 Jan 2011, Justin Ruggles wrote:
>>>> + sub offset1q, 256
>>>> + cmp offset1q, offsetq
>>>
>>> It is usually possible to arrange your pointers such that a loop ends with
>>> an offset of 0, and then you can take the flags from the add/sub instead of
>>> a separate cmp.
>>
>> Or check for underflow. ie jns
>>
>> sub offset1q, 256
>> js next
>> top:
>> ...
>> sub offset1q, 256
>> jns top
>> next:
>
> I don't think it's as simple as that for the inner loop in this case.
> It doesn't decrement to 0, it decrements to the first block. If I make
> offset1 lower by 256 and decrement to 0 it works, but then I have to add
> 256 when loading from memory, and it ends up being slower than the way I
> have it currently.
The first iteration that doesn't run is when offset1q goes negative.
That's good enough. Just remove the cmp and change jne to jae.
Or for the general case, don't undo the munging in the inner loop, munge
the base pointer. Applying that to this function produces
%macro AC3_EXPONENT_MIN 1
cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, expn
cmp reuse_blksd, 0
je .end
sal reuse_blksd, 8
mov expnd, reuse_blksd
.nextexp:
mov offsetd, reuse_blksd
mova m0, [expq]
.nextblk:
%ifidn %1, mmx
PMINUB_MMX m0, [expq+offsetq], m1
%else ; mmxext/sse2
pminub m0, [expq+offsetq]
%endif
sub offsetd, 256
jae .nextblk
mova [expq], m0
add expq, mmsize
sub expnd, mmsize
jae .nextexp
.end:
REP_RET
%endmacro
... which is 6x slower on Conroe x86_64, so I must have done something wrong.
--Loren Merritt
More information about the ffmpeg-devel
mailing list