[FFmpeg-devel] [PATCH] VC-1 MMX DSP functions
Christophe GISQUET
christophe.gisquet
Sun Nov 18 17:20:35 CET 2007
Michael Niedermayer a ?crit :
> On Sat, Nov 17, 2007 at 12:33:31PM +0100, Christophe GISQUET wrote:
>> +#define SHIFT2_16B_END_LINE(R) \
>> + "psraw %5, %%mm"#R" \n\t" \
>> + "movq %%mm"#R", (%2) \n\t" \
>> + "add %3, %1 \n\t" \
>> + "add $24, %2 \n\t"
>
> the $24 add can be avoided by using a offset for the movq above
Applied. Also made me see I didn't use SHIFT2_8B_END_LINE macro.
>> + "movq %%mm3, %%mm1 \n\t" /* 0,1,1,0*/
>> + "movq %%mm4, %%mm2 \n\t" /* 0,1,1,0*/
>> + "psubw %%mm5, %%mm3 \n\t" /*-1,1,1,0*/
>> + "psubw %%mm6, %%mm4 \n\t" /*-1,1,1,0*/
>> + "psllw $3, %%mm1 \n\t" /* 0,8,8,0*/
>> + "psllw $3, %%mm2 \n\t" /* 0,8,8,0*/
>> + "movd 0(%1,%3), %%mm5 \n\t"
>> + "movd 4(%1,%3), %%mm6 \n\t"
>> + "paddw %%mm1, %%mm3 \n\t" /*-1,9,9,0*/
>> + "paddw %%mm2, %%mm4 \n\t" /*-1,9,9,0*/
>> + "punpcklbw %%mm0, %%mm5 \n\t"
>> + "punpcklbw %%mm0, %%mm6 \n\t"
>> + "psubw %%mm5, %%mm3 \n\t" /*-1,9,9,-1*/
>> + "psubw %%mm6, %%mm4 \n\t" /*-1,9,9,-1*/
>
>
> "psubw %%mm3, %%mm5 \n\t" /* 1,-1,-1, 0*/
> "psubw %%mm4, %%mm6 \n\t" /* 1,-1,-1, 0*/
> "psllw $3, %%mm3 \n\t" /* 0,8,8,0*/
> "psllw $3, %%mm4 \n\t" /* 0,8,8,0*/
> "movd 0(%1,%3), %%mm1 \n\t"
> "movd 4(%1,%3), %%mm2 \n\t"
> "psubw %%mm5, %%mm3 \n\t" /*-1,9,9,0*/
> "psubw %%mm6, %%mm4 \n\t" /*-1,9,9,0*/
> "punpcklbw %%mm0, %%mm1 \n\t"
> "punpcklbw %%mm0, %%mm2 \n\t"
> "psubw %%mm1, %%mm3 \n\t" /*-1,9,9,-1*/
> "psubw %%mm2, %%mm4 \n\t" /*-1,9,9,-1*/
Yes, but I've decided to use pmullw here... (see below).
>> + "movq %%mm1, %%mm3 \n\t" \
>> + "movq %%mm2, %%mm4 \n\t" \
>> + "paddw %%mm1, %%mm1 \n\t" \
>> + "paddw %%mm2, %%mm2 \n\t" \
>> + "paddw %%mm3, %%mm1 \n\t" /* 3* */ \
>> + "paddw %%mm4, %%mm2 \n\t" /* 3* */ \
>
> have you checked that pmullw with 3 is not faster?
It only improves the horizontal pass (2550 vs 2700 dezicycles ie 5%).
Other seem improved too, but by less than 1%.
There are 2 reasons why I didn't want to use pmullw as much as possible:
- here, I couldn't load the factor in a register (seems less speed
critical than in my recollection)
- I have a core2 and an Athlon computers; both have a latency for pmullw
of 3; I think some P4 have a latency of 6.
As for the code change you proposed in the previous paragraph, I decided
to retest vc1_put_shift2_mmx with a pmullw. It did improve the speed of
this function on my core2 (around 5%).
Therefore, the patch I attached implements the 2 uses of pmullw. It's
better on my computer, I don't know for P4.
Best regards,
--
Christophe GISQUET
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vc1dsp.diff
Type: text/x-patch
Size: 26117 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071118/73891a58/attachment.bin>
More information about the ffmpeg-devel
mailing list