[FFmpeg-devel] [PATCH] MMX VP3 Loop Filter
David Conrad
lessen42
Sun Oct 12 20:24:48 CEST 2008
On Oct 12, 2008, at 5:51 AM, Michael Niedermayer wrote:
> On Sat, Oct 11, 2008 at 08:40:23PM -0400, David Conrad wrote:
>> On Oct 11, 2008, at 6:03 AM, Michael Niedermayer wrote:
>>
>>> On Sat, Oct 11, 2008 at 04:53:24AM -0400, David Conrad wrote:
>>>> On Oct 8, 2008, at 1:59 AM, David Conrad wrote:
>>>>
>>>>> On Oct 7, 2008, at 5:43 AM, Jason Garrett-Glaser wrote:
>>>>>
>>>>>>> Here's an 8-bit version. However, checking for the C fallback
>>>>>>> negates
>>>>>>> the
>>>>>>> small speedup on my Penryn compared to the 16-bit version.
>>>>>>
>>>>>> Most of the code is still 16-bit. Are you sure this can't be
>>>>>> done
>>>>>> x264-style with emulation of extra bits and 8-bit math
>>>>>> (reference for
>>>>>> an example of how to do this: common/x86/deblock-a.asm in x264
>>>>>> tree)?
>>>>>> This would eliminate the need for all unpacks, all packs, and all
>>>>>> multiplication, and probably increase speed dramatically. I
>>>>>> strongly
>>>>>> suspect that it can be done, as the deblocking formulas seem very
>>>>>> similar to those used in H.264.
>>>>>
>>>>> It seems like you're right; the only difference between
>>>>> DEBLOCK_P0_Q0
>>>>> and
>>>>> VP3 is a *3 vs. a *4 in H.264.
>>>>> I don't quite fully understand x264's implementation, so it'll
>>>>> take
>>>>> another bit to adapt it.
>>>>
>>>> And here's an entirely 8-bit implementation. ~3 cycles faster
>>>> than the
>>>> last
>>>> patch I posted.
>>>
>>>> I'm not sure the best way to avoid the duplication of ff_pb_1/3/7
>>>> constants; there aren't enough registers to pass the address of
>>>> all of
>>>> the
>>>> constants I need.
>>>
>>> try MANGLE()
>>
>> Done.
>>
>>> [...]
>>>> +\
>>>> + "movd "#flim", %%mm5 \n\t" \
>>>> + "punpcklbw %%mm5, %%mm5 \n\t" \
>>>
>>> you could pass the thing from mm5 at the end of the
>>> bounding_values array,
>>> this also would make filter_limit unneeded, avoid the *0x02020202
>>> and the
>>> punpcklbw
>>
>> Done.
>>
>
> [...]
>> @@ -86,6 +88,20 @@ extern const double ff_pd_2[2];
>> SBUTTERFLY(a,c,d,dq,q) /* a=aeim d=bfjn */\
>> SBUTTERFLY(t,b,c,dq,q) /* t=cgko c=dhlp */
>>
>> +#define TRANSPOSE8x4(a,b,c,d,e,f,g,h)\
>> + "punpcklbw " #e ", " #a " \n\t" /* a0 e0 a1 e1 a2 e2 a3 e3 */\
>> + "punpcklbw " #f ", " #b " \n\t" /* b0 f0 b1 f1 b2 f2 b3 f3 */\
>> + "punpcklbw " #g ", " #c " \n\t" /* c0 g0 c1 g1 c2 g2 d3 g3 */\
>> + "punpcklbw " #h ", " #d " \n\t" /* d0 h0 d1 h1 d2 h2 d3 h3 */\
>> + SBUTTERFLY(a, b, e, bw, q) /* a= a0 b0 e0 f0 a1 b1 e1 f1 */\
>> + /* e= a2 b2 e2 f2 a3 b3 e3 f3 */\
>> + SBUTTERFLY(c, d, b, bw, q) /* c= c0 d0 g0 h0 c1 d1 g1 h1 */\
>> + /* b= c2 d2 g2 h2 c3 d3 g3 h3 */\
>> + SBUTTERFLY(a, c, d, wd, q) /* a= a0 b0 c0 d0 e0 f0 g0 h0 */\
>> + /* d= a1 b1 c1 d1 e1 f1 g1 h1 */\
>> + SBUTTERFLY(e, b, c, wd, q) /* e= a2 b2 c2 d2 e2 f2 g2 h2 */\
>> + /* c= a3 b3 c3 d3 e3 f3 g3 h3 */
>
> i dont know if it would be faster but punpcklbw could read from memory
> making seperate movq unneeded
It was a little faster, so I changed the macro to allow for memory
inputs.
> [...]
>> +void ff_vp3_v_loop_filter_mmx(uint8_t *src, int stride, int
>> *bounding_values)
>> +{
>
>> + if (bounding_values[129] > 63*0x02020202) {
>> + ff_vp3_v_loop_filter_c(src, stride, bounding_values);
>> + return;
>> + }
>
> it would be faster to not do this in the inner loop, though it would
> be
> less clean ...
How about putting the mmx under CODEC_FLAG_BITEXACT? Even with
filter_limit > 63, the calculations are only off by one or two in
extreme cases (filter values of 127,128 are not differentiated from
126.)
> except these iam fine with the patch
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vp3-mmx-loop-filter-6.txt
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20081012/1b788891/attachment.txt>
-------------- next part --------------
More information about the ffmpeg-devel
mailing list