[FFmpeg-devel] [PATCH] MMX VP3 Loop Filter
Michael Niedermayer
michaelni
Sun Oct 12 11:51:18 CEST 2008
On Sat, Oct 11, 2008 at 08:40:23PM -0400, David Conrad wrote:
> On Oct 11, 2008, at 6:03 AM, Michael Niedermayer wrote:
>
>> On Sat, Oct 11, 2008 at 04:53:24AM -0400, David Conrad wrote:
>>> On Oct 8, 2008, at 1:59 AM, David Conrad wrote:
>>>
>>>> On Oct 7, 2008, at 5:43 AM, Jason Garrett-Glaser wrote:
>>>>
>>>>>> Here's an 8-bit version. However, checking for the C fallback negates
>>>>>> the
>>>>>> small speedup on my Penryn compared to the 16-bit version.
>>>>>
>>>>> Most of the code is still 16-bit. Are you sure this can't be done
>>>>> x264-style with emulation of extra bits and 8-bit math (reference for
>>>>> an example of how to do this: common/x86/deblock-a.asm in x264 tree)?
>>>>> This would eliminate the need for all unpacks, all packs, and all
>>>>> multiplication, and probably increase speed dramatically. I strongly
>>>>> suspect that it can be done, as the deblocking formulas seem very
>>>>> similar to those used in H.264.
>>>>
>>>> It seems like you're right; the only difference between DEBLOCK_P0_Q0
>>>> and
>>>> VP3 is a *3 vs. a *4 in H.264.
>>>> I don't quite fully understand x264's implementation, so it'll take
>>>> another bit to adapt it.
>>>
>>> And here's an entirely 8-bit implementation. ~3 cycles faster than the
>>> last
>>> patch I posted.
>>
>>> I'm not sure the best way to avoid the duplication of ff_pb_1/3/7
>>> constants; there aren't enough registers to pass the address of all of
>>> the
>>> constants I need.
>>
>> try MANGLE()
>
> Done.
>
>> [...]
>>> +\
>>> + "movd "#flim", %%mm5 \n\t" \
>>> + "punpcklbw %%mm5, %%mm5 \n\t" \
>>
>> you could pass the thing from mm5 at the end of the bounding_values array,
>> this also would make filter_limit unneeded, avoid the *0x02020202 and the
>> punpcklbw
>
> Done.
>
[...]
> @@ -86,6 +88,20 @@ extern const double ff_pd_2[2];
> SBUTTERFLY(a,c,d,dq,q) /* a=aeim d=bfjn */\
> SBUTTERFLY(t,b,c,dq,q) /* t=cgko c=dhlp */
>
> +#define TRANSPOSE8x4(a,b,c,d,e,f,g,h)\
> + "punpcklbw " #e ", " #a " \n\t" /* a0 e0 a1 e1 a2 e2 a3 e3 */\
> + "punpcklbw " #f ", " #b " \n\t" /* b0 f0 b1 f1 b2 f2 b3 f3 */\
> + "punpcklbw " #g ", " #c " \n\t" /* c0 g0 c1 g1 c2 g2 d3 g3 */\
> + "punpcklbw " #h ", " #d " \n\t" /* d0 h0 d1 h1 d2 h2 d3 h3 */\
> + SBUTTERFLY(a, b, e, bw, q) /* a= a0 b0 e0 f0 a1 b1 e1 f1 */\
> + /* e= a2 b2 e2 f2 a3 b3 e3 f3 */\
> + SBUTTERFLY(c, d, b, bw, q) /* c= c0 d0 g0 h0 c1 d1 g1 h1 */\
> + /* b= c2 d2 g2 h2 c3 d3 g3 h3 */\
> + SBUTTERFLY(a, c, d, wd, q) /* a= a0 b0 c0 d0 e0 f0 g0 h0 */\
> + /* d= a1 b1 c1 d1 e1 f1 g1 h1 */\
> + SBUTTERFLY(e, b, c, wd, q) /* e= a2 b2 c2 d2 e2 f2 g2 h2 */\
> + /* c= a3 b3 c3 d3 e3 f3 g3 h3 */
i dont know if it would be faster but punpcklbw could read from memory
making seperate movq unneeded
[...]
> +void ff_vp3_v_loop_filter_mmx(uint8_t *src, int stride, int *bounding_values)
> +{
> + if (bounding_values[129] > 63*0x02020202) {
> + ff_vp3_v_loop_filter_c(src, stride, bounding_values);
> + return;
> + }
it would be faster to not do this in the inner loop, though it would be
less clean ...
except these iam fine with the patch
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20081012/710d4ae0/attachment.pgp>
More information about the ffmpeg-devel
mailing list