[FFmpeg-devel] [PATCH] MMX VP3 Loop Filter

Fri Oct 17 05:14:08 CEST 2008

On Oct 15, 2008, at 6:02 AM, Michael Niedermayer wrote:

> On Sun, Oct 12, 2008 at 02:24:48PM -0400, David Conrad wrote:
>> On Oct 12, 2008, at 5:51 AM, Michael Niedermayer wrote:
>>
>>> On Sat, Oct 11, 2008 at 08:40:23PM -0400, David Conrad wrote:
>>>> On Oct 11, 2008, at 6:03 AM, Michael Niedermayer wrote:
>>>>
>>>>> On Sat, Oct 11, 2008 at 04:53:24AM -0400, David Conrad wrote:
>>>>>> On Oct 8, 2008, at 1:59 AM, David Conrad wrote:
>>>>>>
>>>>>>> On Oct 7, 2008, at 5:43 AM, Jason Garrett-Glaser wrote:
>>>>>>>
>>>>>>>>> Here's an 8-bit version. However, checking for the C fallback
>>>>>>>>> negates
>>>>>>>>> the
>>>>>>>>> small speedup on my Penryn compared to the 16-bit version.
>>>>>>>>
>>>>>>>> Most of the code is still 16-bit.  Are you sure this can't be  
>>>>>>>> done
>>>>>>>> x264-style with emulation of extra bits and 8-bit math  
>>>>>>>> (reference for
>>>>>>>> an example of how to do this: common/x86/deblock-a.asm in  
>>>>>>>> x264 tree)?
>>>>>>>> This would eliminate the need for all unpacks, all packs, and  
>>>>>>>> all
>>>>>>>> multiplication, and probably increase speed dramatically.  I  
>>>>>>>> strongly
>>>>>>>> suspect that it can be done, as the deblocking formulas seem  
>>>>>>>> very
>>>>>>>> similar to those used in H.264.
>>>>>>>
>>>>>>> It seems like you're right; the only difference between  
>>>>>>> DEBLOCK_P0_Q0
>>>>>>> and
>>>>>>> VP3 is a *3 vs. a *4 in H.264.
>>>>>>> I don't quite fully understand x264's implementation, so it'll  
>>>>>>> take
>>>>>>> another bit to adapt it.
>>>>>>
>>>>>> And here's an entirely 8-bit implementation. ~3 cycles faster  
>>>>>> than the
>>>>>> last
>>>>>> patch I posted.
>>>>>
>>>>>> I'm not sure the best way to avoid the duplication of ff_pb_1/3/7
>>>>>> constants; there aren't enough registers to pass the address of  
>>>>>> all of
>>>>>> the
>>>>>> constants I need.
>>>>>
>>>>> try MANGLE()
>>>>
>>>> Done.
>>>>
>>>>> [...]
>>>>>> +\
>>>>>> +    "movd     "#flim", %%mm5 \n\t" \
>>>>>> +    "punpcklbw  %%mm5, %%mm5 \n\t" \
>>>>>
>>>>> you could pass the thing from mm5 at the end of the  
>>>>> bounding_values
>>>>> array,
>>>>> this also would make filter_limit unneeded, avoid the  
>>>>> *0x02020202 and
>>>>> the
>>>>> punpcklbw
>>>>
>>>> Done.
>>>>
>>>
>>> [...]
>>>> @@ -86,6 +88,20 @@ extern const double ff_pd_2[2];
>>>>    SBUTTERFLY(a,c,d,dq,q) /* a=aeim d=bfjn */\
>>>>    SBUTTERFLY(t,b,c,dq,q) /* t=cgko c=dhlp */
>>>>
>>>> +#define TRANSPOSE8x4(a,b,c,d,e,f,g,h)\
>>>> +    "punpcklbw " #e ", " #a " \n\t" /* a0 e0 a1 e1 a2 e2 a3 e3 */\
>>>> +    "punpcklbw " #f ", " #b " \n\t" /* b0 f0 b1 f1 b2 f2 b3 f3 */\
>>>> +    "punpcklbw " #g ", " #c " \n\t" /* c0 g0 c1 g1 c2 g2 d3 g3 */\
>>>> +    "punpcklbw " #h ", " #d " \n\t" /* d0 h0 d1 h1 d2 h2 d3 h3 */\
>>>> +    SBUTTERFLY(a, b, e, bw, q)   /* a= a0 b0 e0 f0 a1 b1 e1 f1 */\
>>>> +                                 /* e= a2 b2 e2 f2 a3 b3 e3 f3 */\
>>>> +    SBUTTERFLY(c, d, b, bw, q)   /* c= c0 d0 g0 h0 c1 d1 g1 h1 */\
>>>> +                                 /* b= c2 d2 g2 h2 c3 d3 g3 h3 */\
>>>> +    SBUTTERFLY(a, c, d, wd, q)   /* a= a0 b0 c0 d0 e0 f0 g0 h0 */\
>>>> +                                 /* d= a1 b1 c1 d1 e1 f1 g1 h1 */\
>>>> +    SBUTTERFLY(e, b, c, wd, q)   /* e= a2 b2 c2 d2 e2 f2 g2 h2 */\
>>>> +                                 /* c= a3 b3 c3 d3 e3 f3 g3 h3 */
>>>
>>> i dont know if it would be faster but punpcklbw could read from  
>>> memory
>>> making seperate movq unneeded
>>
>> It was a little faster, so I changed the macro to allow for memory  
>> inputs.
>>
>>> [...]
>>>> +void ff_vp3_v_loop_filter_mmx(uint8_t *src, int stride, int
>>>> *bounding_values)
>>>> +{
>>>
>>>> +    if (bounding_values[129] > 63*0x02020202) {
>>>> +        ff_vp3_v_loop_filter_c(src, stride, bounding_values);
>>>> +        return;
>>>> +    }
>>>
>>> it would be faster to not do this in the inner loop, though it  
>>> would be
>>> less clean ...
>>
>> How about putting the mmx under CODEC_FLAG_BITEXACT? Even with  
>> filter_limit
>>> 63, the calculations are only off by one or two in extreme cases  
>>> (filter
>> values of 127,128 are not differentiated from 126.)
>
> ok

Applied.

>
>>
>>> except these iam fine with the patch
>>
>
> [...]
> -- 
> Michael     GnuPG fingerprint:  
> 9FF2128B147EF6730BADF133611EC787040B0FAB
>
>> ... defining _GNU_SOURCE...
> For the love of all that is holy, and some that is not, don't do that.
> -- Luca & Mans
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at mplayerhq.hu
> https://lists.mplayerhq.hu/mailman/listinfo/ffmpeg-devel