[FFmpeg-devel] [PATCH] VP8 luma(16) inner-MB H/V loopfilter MMX/SSE2

Mon Jul 12 01:14:57 CEST 2010

On Sun, Jul 11, 2010 at 4:02 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> Hi Eli,
>
> On Sun, Jul 11, 2010 at 2:20 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
>> On Sun, Jul 11, 2010 at 8:53 AM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>>> as per $subj. All tested to be identical to C reference. If wanted, I
>>> can try to share parts of the filter code with the simple loopfilter,
>>> but I'm a little scared that it'll turn into massive spaghetti so I
>>> didn't do it yet.
>>
>> + ? ?mova ? ? ? ? ? ? m4, m1
>> + ? ?SWAP ? ? ? ? ? ? ?4, 1
>>
>> This pattern seems to be repeated a lot... I fail to see the point.
>> Swapping two registers with the same contents doesn't do anything
>> significant.
>
> What I've been told is that for a mova x, y, the x (dest) is available
> one cycle after the source, so using y (src) directly after is
> preferred over using x directly after. In this function, I'm trying to
> organize it such that m2-m5 are (at least in the source) consistently
> referring to p1/p0/q0/q1, hence the SWAP.
>
>> For the following:
>> + ? ?mova ? ? ? ? ? ? m4, [rsp+mmsize]
>> + ? ?pxor ? ? ? ? ? ? m3, m3
>> + ? ?psubusb ? ? ? ? ?m0, m4
>> + ? ?psubusb ? ? ? ? ?m1, m4
>> + ? ?psubusb ? ? ? ? ?m7, m4
>> + ? ?psubusb ? ? ? ? ?m6, m4
>> + ? ?pcmpeqb ? ? ? ? ?m0, m3 ? ? ? ?; abs(p3-p2) <= I
>> + ? ?pcmpeqb ? ? ? ? ?m1, m3 ? ? ? ?; abs(p2-p1) <= I
>> + ? ?pcmpeqb ? ? ? ? ?m7, m3 ? ? ? ?; abs(q3-q2) <= I
>> + ? ?pcmpeqb ? ? ? ? ?m6, m3 ? ? ? ?; abs(q2-q1) <= I
>> + ? ?pand ? ? ? ? ? ? m0, m1
>> + ? ?pand ? ? ? ? ? ? m7, m6
>> + ? ?pand ? ? ? ? ? ? m0, m7
>>
>> The following should be faster with mmxext/sse2:
>>
>> ? ?mova ? ? ? ? ? ? m4, [rsp+mmsize]
>> ? ?pxor ? ? ? ? ? ? m3, m3
>> ? ?pmaxub ? ? ? ? ?m0, m1
>> ? ?pmaxub ? ? ? ? ?m6, m7
>> ? ?pmaxub ? ? ? ? ?m0, m6
>> ? ?psubusb ? ? ? ? ?m0, m4
>> ? ?pcmpeqb ? ? ? ? ?m0, m3
>
> Indeed, and I've extended that a bit also, that's quite a big win, >10 cycles.
>
>> + ? ?mova ? ? ? ? ? ? m6, [rsp+mmsize*3]
>> + ? ?pxor ? ? ? ? ? ? m7, m7
>> + ? ?pand ? ? ? ? ? ? m0, m6
>> + ? ?pand ? ? ? ? ? ? m1, m6
>> + ? ?pavgb ? ? ? ? ? ?m0, m7 ? ? ? ?; a
>> + ? ?psubusb ? ? ? ? ?m1, [pb_1]
>> + ? ?pavgb ? ? ? ? ? ?m1, m7 ? ? ? ?; -a
>> + ? ?psubusb ? ? ? ? ?m5, m0
>> + ? ?paddusb ? ? ? ? ?m5, m1 ? ? ? ?; q1-a
>> + ? ?psubusb ? ? ? ? ?m2, m1
>> + ? ?paddusb ? ? ? ? ?m2, m0 ? ? ? ?; p1+a
>>
>> pavgb is mmxext/sse2 only.
>
> Indeed again, I've replaced the MMX version with a slightly slower code.
>
> New version attached, still bitexact for everything, thanks for the
> comments so far.
>
> Ronald.

FYI, pavgusb is 3dnow and does the exact same thing as pavgb, so you
can template a 3dnow version by %defining pavgb accordingly.

Dark Shikari