[FFmpeg-devel] [PATCH] SIMD-optimized exponent_min() for ac3enc
Justin Ruggles
justin.ruggles
Mon Jan 17 01:14:48 CET 2011
On 01/16/2011 02:52 PM, Loren Merritt wrote:
> On Sun, 16 Jan 2011, Justin Ruggles wrote:
>
>>>>> + sub offset1q, 256
>>>>> + cmp offset1q, offsetq
>>>>
>>>> It is usually possible to arrange your pointers such that a loop ends with
>>>> an offset of 0, and then you can take the flags from the add/sub instead of
>>>> a separate cmp.
>>>
>>> Or check for underflow. ie jns
>>>
>>> sub offset1q, 256
>>> js next
>>> top:
>>> ...
>>> sub offset1q, 256
>>> jns top
>>> next:
>>
>> I don't think it's as simple as that for the inner loop in this case.
>> It doesn't decrement to 0, it decrements to the first block. If I make
>> offset1 lower by 256 and decrement to 0 it works, but then I have to add
>> 256 when loading from memory, and it ends up being slower than the way I
>> have it currently.
>
> The first iteration that doesn't run is when offset1q goes negative.
> That's good enough. Just remove the cmp and change jne to jae.
The first iteration that doesn't run is when offset1q == offsetq, and
offsetq is always 0 to [80..256]-mm_size.
>
> Or for the general case, don't undo the munging in the inner loop, munge
> the base pointer. Applying that to this function produces
>
> %macro AC3_EXPONENT_MIN 1
> cglobal ac3_exponent_min_%1, 3,4,1, exp, reuse_blks, offset, expn
> cmp reuse_blksd, 0
> je .end
> sal reuse_blksd, 8
> mov expnd, reuse_blksd
> .nextexp:
> mov offsetd, reuse_blksd
> mova m0, [expq]
> .nextblk:
> %ifidn %1, mmx
> PMINUB_MMX m0, [expq+offsetq], m1
> %else ; mmxext/sse2
> pminub m0, [expq+offsetq]
> %endif
> sub offsetd, 256
> jae .nextblk
> mova [expq], m0
> add expq, mmsize
> sub expnd, mmsize
> jae .nextexp
> .end:
> REP_RET
> %endmacro
>
> ... which is 6x slower on Conroe x86_64, so I must have done something wrong.
Yeah, it's wrong in several ways. The outer loop is supposed to run
offset/mmsize times (offset is 80 to 256), step mmsize. The inner loop
is supposed to run reuse_blks times, step 256, for each outer loop
iteration.
Reversing the outer loop seems unrelated to what you've mentioned. I
don't see how it helps. Is it actually faster to have an extra add
instead of an offset in the load and store?
I think I get what you mean about adjusting base pointer though. I'll
try it.
Thanks,
Justin
More information about the ffmpeg-devel
mailing list