[FFmpeg-devel] [PATCH] h264pred16x16 plane sse2/ssse3 optimizations

Ronald S. Bultje rsbultje
Wed Oct 6 00:07:01 CEST 2010


Hi,

On Sat, Oct 2, 2010 at 11:37 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> On Thu, Sep 30, 2010 at 10:08 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> On Wed, Sep 29, 2010 at 9:17 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>> On Wed, Sep 29, 2010 at 08:56:13PM -0400, Ronald S. Bultje wrote:
>>>> On Wed, Sep 29, 2010 at 8:51 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>>> > On Tue, Sep 28, 2010 at 10:31:51PM -0400, Ronald S. Bultje wrote:
>>>> >> + ? ?lea ? ? ? ? ?r4, [r0+r2*8-1]
>>>> >> + ? ?lea ? ? ? ? ?r3, [r0+r2*4-1]
>>>> >> + ? ?add ? ? ? ? ?r4, r2
>>>> >> +
>>>> >> +%ifdef ARCH_X86_64
>>>> >> +%define e_reg r11
>>>> >> +%else
>>>> >> +%define e_reg r0
>>>> >> +%endif
>>>> >> +
>>>> >
>>>> > i see alot of r0-1 maybe r0 could be decreased by 1 somewhere?
>>>>
>>>> Yes, this is actually both smaller/simpler and also faster. Changed.
>>>>
>>>> >> + ? ?movzx ? ? e_reg, byte [r3+r1 ? ?]
>>>> >> + ? ?movzx ? ? ? ?r5, byte [r4+r2*2 ?]
>>>> >> + ? ?sub ? ? ? ? ?r5, e_reg
>>>> >> + ? ?shl ? ? ? ? ?r5, 2
>>>> >> +
>>>> >> + ? ?movzx ? ? e_reg, byte [r3 ? ? ? ]
>>>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r2 ? ?]
>>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>>>> >> + ? ?sub ? ? ? ? ?r5, r6
>>>> >> +
>>>> >> + ? ?movzx ? ? e_reg, byte [r3+r2 ? ?]
>>>> >> + ? ?movzx ? ? ? ?r6, byte [r4 ? ? ? ]
>>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>>>> >> +
>>>> >> + ? ?movzx ? ? e_reg, byte [r3+r2*2 ?]
>>>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r1 ? ?]
>>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>>> >> + ? ?add ? ? ? ? ?r5, r6
>>>> >
>>>> > this and the shl 2 case look like they could be merged like
>>>> > add+shl->lea
>>>>
>>>> Also changed.
>>>>
>>>> >> + ? ?lea ? ? ? ? ?r3, [r4+r2*4 ?]
>>>> >> +
>>>> >> + ? ?movzx ? ? e_reg, byte [r0+r1 ?-1]
>>>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2*2 ?]
>>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>>>> >> +
>>>> >> + ? ?movzx ? ? e_reg, byte [r0 ? ? -1]
>>>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2 ? ?]
>>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>>>> >> + ? ?sub ? ? ? ? ?r5, r6
>>>> >
>>>> > the *7 with lea + sub can maybe be changed to a add into the *8 case and a
>>>> > subtract (replacing lea by add)
>>>> >
>>>> >> + ? ?movzx ? ? e_reg, byte [r0+r2 ?-1]
>>>> >> + ? ?movzx ? ? ? ?r6, byte [r3 ? ? ? ]
>>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>>>> >
>>>> > this could add into *4 and *2 cases to replace the 2 leas by 2 adds
>>>> > or to leas *2 into the *3 case redusing the 2 leas to 1
>>>> > similar tricks may be possible elsewhere
>>>>
>>>> I didn't quite get these two, what exactly would you like me to try?
>>>
>>> a+=8*c
>>> a+=8*b
>>> a-=b
>>>
>>> to
>>>
>>> c+=b
>>> a+=8*c
>>> a-=b
>>>
>>> ----
>>> a+=2*b
>>> a+=b
>>> a+=2*c
>>> a+=4*c
>>>
>>> to
>>>
>>> b+=2*c
>>> a+=2*b
>>> a+=b
>>
>> OK, new patch attached. The caveat here is that on x86-32 I don't
>> think I have enough registers (I could do it in a really linear
>> path-way but then I'm affraid that'd make it slower on Atom or so), so
>> I only did this on x86-64. A little spaghetti-code maybe... Let me
>> know if that's OK or if you prefer the linear-way (that'd be saving
>> the result of the first, then use the same register for the two
>> movzx's and directly adding/subbing them from the stored register of
>> the previous two values), i.e.:
>>
>> movzx a, [val1a]
>> movzx b, [val1b]
>> sub a, b
>> sub res, a
>>
>> movzx b, [val2a]
>> add a, b
>> movzx b, [val2b]
>> sub a, b
>> lea res, [res+a*4/8]
>>
>> As for performance, the second suggestion saved several cycles, the
>> first didn't really have an effect (0.2 cycle faster, i.e. probably
>> noise). I also added an ALIGN 16 to the 8x8. Otherwise unchanged. Make
>> fate-h264 still passes on both x86-64 and x86-32 (which is basically
>> unchanged).
>
> $(ping + 3-day apply threat).

Applied.

Ronald



More information about the ffmpeg-devel mailing list