[FFmpeg-devel] [PATCH] MMX VP3 Loop Filter

Michael Niedermayer michaelni
Wed Oct 15 12:02:08 CEST 2008


On Sun, Oct 12, 2008 at 02:24:48PM -0400, David Conrad wrote:
> On Oct 12, 2008, at 5:51 AM, Michael Niedermayer wrote:
>
>> On Sat, Oct 11, 2008 at 08:40:23PM -0400, David Conrad wrote:
>>> On Oct 11, 2008, at 6:03 AM, Michael Niedermayer wrote:
>>>
>>>> On Sat, Oct 11, 2008 at 04:53:24AM -0400, David Conrad wrote:
>>>>> On Oct 8, 2008, at 1:59 AM, David Conrad wrote:
>>>>>
>>>>>> On Oct 7, 2008, at 5:43 AM, Jason Garrett-Glaser wrote:
>>>>>>
>>>>>>>> Here's an 8-bit version. However, checking for the C fallback 
>>>>>>>> negates
>>>>>>>> the
>>>>>>>> small speedup on my Penryn compared to the 16-bit version.
>>>>>>>
>>>>>>> Most of the code is still 16-bit.  Are you sure this can't be done
>>>>>>> x264-style with emulation of extra bits and 8-bit math (reference for
>>>>>>> an example of how to do this: common/x86/deblock-a.asm in x264 tree)?
>>>>>>> This would eliminate the need for all unpacks, all packs, and all
>>>>>>> multiplication, and probably increase speed dramatically.  I strongly
>>>>>>> suspect that it can be done, as the deblocking formulas seem very
>>>>>>> similar to those used in H.264.
>>>>>>
>>>>>> It seems like you're right; the only difference between DEBLOCK_P0_Q0
>>>>>> and
>>>>>> VP3 is a *3 vs. a *4 in H.264.
>>>>>> I don't quite fully understand x264's implementation, so it'll take
>>>>>> another bit to adapt it.
>>>>>
>>>>> And here's an entirely 8-bit implementation. ~3 cycles faster than the
>>>>> last
>>>>> patch I posted.
>>>>
>>>>> I'm not sure the best way to avoid the duplication of ff_pb_1/3/7
>>>>> constants; there aren't enough registers to pass the address of all of
>>>>> the
>>>>> constants I need.
>>>>
>>>> try MANGLE()
>>>
>>> Done.
>>>
>>>> [...]
>>>>> +\
>>>>> +    "movd     "#flim", %%mm5 \n\t" \
>>>>> +    "punpcklbw  %%mm5, %%mm5 \n\t" \
>>>>
>>>> you could pass the thing from mm5 at the end of the bounding_values 
>>>> array,
>>>> this also would make filter_limit unneeded, avoid the *0x02020202 and 
>>>> the
>>>> punpcklbw
>>>
>>> Done.
>>>
>>
>> [...]
>>> @@ -86,6 +88,20 @@ extern const double ff_pd_2[2];
>>>     SBUTTERFLY(a,c,d,dq,q) /* a=aeim d=bfjn */\
>>>     SBUTTERFLY(t,b,c,dq,q) /* t=cgko c=dhlp */
>>>
>>> +#define TRANSPOSE8x4(a,b,c,d,e,f,g,h)\
>>> +    "punpcklbw " #e ", " #a " \n\t" /* a0 e0 a1 e1 a2 e2 a3 e3 */\
>>> +    "punpcklbw " #f ", " #b " \n\t" /* b0 f0 b1 f1 b2 f2 b3 f3 */\
>>> +    "punpcklbw " #g ", " #c " \n\t" /* c0 g0 c1 g1 c2 g2 d3 g3 */\
>>> +    "punpcklbw " #h ", " #d " \n\t" /* d0 h0 d1 h1 d2 h2 d3 h3 */\
>>> +    SBUTTERFLY(a, b, e, bw, q)   /* a= a0 b0 e0 f0 a1 b1 e1 f1 */\
>>> +                                 /* e= a2 b2 e2 f2 a3 b3 e3 f3 */\
>>> +    SBUTTERFLY(c, d, b, bw, q)   /* c= c0 d0 g0 h0 c1 d1 g1 h1 */\
>>> +                                 /* b= c2 d2 g2 h2 c3 d3 g3 h3 */\
>>> +    SBUTTERFLY(a, c, d, wd, q)   /* a= a0 b0 c0 d0 e0 f0 g0 h0 */\
>>> +                                 /* d= a1 b1 c1 d1 e1 f1 g1 h1 */\
>>> +    SBUTTERFLY(e, b, c, wd, q)   /* e= a2 b2 c2 d2 e2 f2 g2 h2 */\
>>> +                                 /* c= a3 b3 c3 d3 e3 f3 g3 h3 */
>>
>> i dont know if it would be faster but punpcklbw could read from memory
>> making seperate movq unneeded
>
> It was a little faster, so I changed the macro to allow for memory inputs.
>
>> [...]
>>> +void ff_vp3_v_loop_filter_mmx(uint8_t *src, int stride, int 
>>> *bounding_values)
>>> +{
>>
>>> +    if (bounding_values[129] > 63*0x02020202) {
>>> +        ff_vp3_v_loop_filter_c(src, stride, bounding_values);
>>> +        return;
>>> +    }
>>
>> it would be faster to not do this in the inner loop, though it would be
>> less clean ...
>
> How about putting the mmx under CODEC_FLAG_BITEXACT? Even with filter_limit 
> > 63, the calculations are only off by one or two in extreme cases (filter 
> values of 127,128 are not differentiated from 126.)

ok


>
>> except these iam fine with the patch
>

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

> ... defining _GNU_SOURCE...
For the love of all that is holy, and some that is not, don't do that.
-- Luca & Mans
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20081015/4aebcb07/attachment.pgp>



More information about the ffmpeg-devel mailing list