[FFmpeg-devel] [PATCH] h264pred16x16 plane sse2/ssse3 optimizations

Sun Oct 3 05:37:14 CEST 2010

Hi,

On Thu, Sep 30, 2010 at 10:08 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> On Wed, Sep 29, 2010 at 9:17 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> On Wed, Sep 29, 2010 at 08:56:13PM -0400, Ronald S. Bultje wrote:
>>> On Wed, Sep 29, 2010 at 8:51 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>>> > On Tue, Sep 28, 2010 at 10:31:51PM -0400, Ronald S. Bultje wrote:
>>> >> + ? ?lea ? ? ? ? ?r4, [r0+r2*8-1]
>>> >> + ? ?lea ? ? ? ? ?r3, [r0+r2*4-1]
>>> >> + ? ?add ? ? ? ? ?r4, r2
>>> >> +
>>> >> +%ifdef ARCH_X86_64
>>> >> +%define e_reg r11
>>> >> +%else
>>> >> +%define e_reg r0
>>> >> +%endif
>>> >> +
>>> >
>>> > i see alot of r0-1 maybe r0 could be decreased by 1 somewhere?
>>>
>>> Yes, this is actually both smaller/simpler and also faster. Changed.
>>>
>>> >> + ? ?movzx ? ? e_reg, byte [r3+r1 ? ?]
>>> >> + ? ?movzx ? ? ? ?r5, byte [r4+r2*2 ?]
>>> >> + ? ?sub ? ? ? ? ?r5, e_reg
>>> >> + ? ?shl ? ? ? ? ?r5, 2
>>> >> +
>>> >> + ? ?movzx ? ? e_reg, byte [r3 ? ? ? ]
>>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r2 ? ?]
>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>>> >> + ? ?sub ? ? ? ? ?r5, r6
>>> >> +
>>> >> + ? ?movzx ? ? e_reg, byte [r3+r2 ? ?]
>>> >> + ? ?movzx ? ? ? ?r6, byte [r4 ? ? ? ]
>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>>> >> +
>>> >> + ? ?movzx ? ? e_reg, byte [r3+r2*2 ?]
>>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r1 ? ?]
>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>> >> + ? ?add ? ? ? ? ?r5, r6
>>> >
>>> > this and the shl 2 case look like they could be merged like
>>> > add+shl->lea
>>>
>>> Also changed.
>>>
>>> >> + ? ?lea ? ? ? ? ?r3, [r4+r2*4 ?]
>>> >> +
>>> >> + ? ?movzx ? ? e_reg, byte [r0+r1 ?-1]
>>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2*2 ?]
>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>>> >> +
>>> >> + ? ?movzx ? ? e_reg, byte [r0 ? ? -1]
>>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2 ? ?]
>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>>> >> + ? ?sub ? ? ? ? ?r5, r6
>>> >
>>> > the *7 with lea + sub can maybe be changed to a add into the *8 case and a
>>> > subtract (replacing lea by add)
>>> >
>>> >> + ? ?movzx ? ? e_reg, byte [r0+r2 ?-1]
>>> >> + ? ?movzx ? ? ? ?r6, byte [r3 ? ? ? ]
>>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>>> >
>>> > this could add into *4 and *2 cases to replace the 2 leas by 2 adds
>>> > or to leas *2 into the *3 case redusing the 2 leas to 1
>>> > similar tricks may be possible elsewhere
>>>
>>> I didn't quite get these two, what exactly would you like me to try?
>>
>> a+=8*c
>> a+=8*b
>> a-=b
>>
>> to
>>
>> c+=b
>> a+=8*c
>> a-=b
>>
>> ----
>> a+=2*b
>> a+=b
>> a+=2*c
>> a+=4*c
>>
>> to
>>
>> b+=2*c
>> a+=2*b
>> a+=b
>
> OK, new patch attached. The caveat here is that on x86-32 I don't
> think I have enough registers (I could do it in a really linear
> path-way but then I'm affraid that'd make it slower on Atom or so), so
> I only did this on x86-64. A little spaghetti-code maybe... Let me
> know if that's OK or if you prefer the linear-way (that'd be saving
> the result of the first, then use the same register for the two
> movzx's and directly adding/subbing them from the stored register of
> the previous two values), i.e.:
>
> movzx a, [val1a]
> movzx b, [val1b]
> sub a, b
> sub res, a
>
> movzx b, [val2a]
> add a, b
> movzx b, [val2b]
> sub a, b
> lea res, [res+a*4/8]
>
> As for performance, the second suggestion saved several cycles, the
> first didn't really have an effect (0.2 cycle faster, i.e. probably
> noise). I also added an ALIGN 16 to the 8x8. Otherwise unchanged. Make
> fate-h264 still passes on both x86-64 and x86-32 (which is basically
> unchanged).

$(ping + 3-day apply threat).

Ronald