[FFmpeg-devel] [PATCH] faster vp6 decoding
Sebastien Lucas
sebastien.lucas
Wed Feb 11 18:03:37 CET 2009
On Wed, Feb 11, 2009 at 4:28 PM, Zuxy Meng <zuxy.meng at gmail.com> wrote:
> Hi,
>
> 2009/2/9 Jason Garrett-Glaser <darkshikari at gmail.com>:
>> + "punpcklbw %%mm7, %%mm0\n\t" \
>> + "punpcklbw %%mm7, %%mm1\n\t" \
>> + "punpckhbw %%mm7, %%mm3\n\t" \
>> + "punpckhbw %%mm7, %%mm4\n\t" \
>> + "pmullw 0(%2), %%mm0\n\t" /* src[x-8 ] * biweight [0] */ \
>> + "pmullw 8(%2), %%mm1\n\t" /* src[x ] * biweight [1] */ \
>> + "pmullw 0(%2), %%mm3\n\t" /* src[x-8 ] * biweight [0] */ \
>> + "pmullw 8(%2), %%mm4\n\t" /* src[x ] * biweight [1] */ \
>> + "paddw %%mm1, %%mm0\n\t" \
>> + "paddw %%mm4, %%mm3\n\t" \
>>
>> This can be done faster with pmaddubsw (SSSE3-only, but worth making
>> another version surely).
>
> Sure but that would require weights to be stored as arrays of int8_t
> instead of int16_t?
Yes and that's almost possible, but not.
>> Worthwhile if you make an SSE version.
>
> SSE2?
>
>> Works by interleaving the weights, allowing you to avoid the unpacks,
>> use only two multiplies, and avoid the adds, too, I think. If I'm
>> right, that makes the entire thing quite a bit less than half the
>> instructions.
>
> I tried something like below and it's about 15% faster on my Pentium
> M. The speed up should be more prominent on modern CPUs with 128 bit
> FADD unit:
>
> #define DIAG4_SSE2(in1,in2,in3,in4) \
> "movq "#in1"(%0), %%xmm0\n\t" \
> "movq "#in2"(%0), %%xmm1\n\t" \
> "punpcklbw %%xmm7, %%xmm0\n\t" \
> "punpcklbw %%xmm7, %%xmm1\n\t" \
> "pmullw %%xmm4, %%xmm0\n\t" /* src[x-8 ] * biweight [0] */ \
> "pmullw %%xmm5, %%xmm1\n\t" /* src[x ] * biweight [1] */ \
> "paddw %%xmm1, %%xmm0\n\t" \
> "movq "#in3"(%0), %%xmm1\n\t" \
> "movq "#in4"(%0), %%xmm2\n\t" \
> "punpcklbw %%xmm7, %%xmm1\n\t" \
> "punpcklbw %%xmm7, %%xmm2\n\t" \
> "pmullw %%xmm6, %%xmm1\n\t" /* src[x+8 ] * biweight [2] */ \
> "pmullw %%xmm3, %%xmm2\n\t" /* src[x+16] * biweight [3] */ \
> "paddw %%xmm2, %%xmm1\n\t" \
> "paddw %%xmm1, %%xmm0\n\t" \
> "paddw _ff_diag4_round, %%xmm0\n\t" /* Add 64 */ \
> "psrlw $7, %%xmm0\n\t" \
> "packuswb %%xmm0, %%xmm0\n\t" \
> "movq %%xmm0, (%1)\n\t"
>
> static void ff_vp6_filter_diag4_sse2(uint8_t *dst, uint8_t *src, int stride,
> const int16_t *h_weights,const int16_t *v_weights)
> {
> uint8_t tmp[8*11];
> uint8_t *t = tmp;
> src -= stride;
>
> asm (
> "pxor %%xmm7, %%xmm7\n\t"
> "movq %4, %%xmm3\n\t"
> "pshuflw $0, %%xmm3, %%xmm4\n\t"
> "punpcklqdq %%xmm4, %%xmm4\n\t"
> "pshuflw $85, %%xmm3, %%xmm5\n\t"
> "punpcklqdq %%xmm5, %%xmm5\n\t"
> "pshuflw $170, %%xmm3, %%xmm6\n\t"
> "punpcklqdq %%xmm6, %%xmm6\n\t"
> "pshuflw $255, %%xmm3, %%xmm3\n\t"
> "punpcklqdq %%xmm3, %%xmm3\n\t"
> "1:\n\t"
> DIAG4_SSE2(-1,0,1,2)
> "addl $8, %1\n\t"
> "addl %2, %0\n\t"
> "decl %3\n\t"
> "jnz 1b\n\t"
> :
> : "r" (src), "r" (t), "g" (stride), "r" (11),
> "m"(*(int64_t*)h_weights)
> : "memory"
> );
>
> t = tmp + 8;
>
> asm (
> "movq %4, %%xmm3\n\t"
> "pshuflw $0, %%xmm3, %%xmm4\n\t"
> "punpcklqdq %%xmm4, %%xmm4\n\t"
> "pshuflw $85, %%xmm3, %%xmm5\n\t"
> "punpcklqdq %%xmm5, %%xmm5\n\t"
> "pshuflw $170, %%xmm3, %%xmm6\n\t"
> "punpcklqdq %%xmm6, %%xmm6\n\t"
> "pshuflw $255, %%xmm3, %%xmm3\n\t"
> "punpcklqdq %%xmm3, %%xmm3\n\t"
> "1:\n\t"
> DIAG4_SSE2(-8,0,8,16)
> "addl $8, %0\n\t"
> "addl %2, %1\n\t"
> "decl %3\n\t"
> "jnz 1b\n\t"
> :
> : "r" (t), "r" (dst), "g" (stride), "r" (8),
> "m"(*(int64_t*)v_weights)
> : "memory"
> );
> }
Thanks for your time, I guess that mean the patch is actually working
(not crashing and even bitexact output ?).
I also used pshufw in my (still unsent) MMXEXT version of the patch.
I'm still tweaking it.
I have a small fix to the MMX code to test, I'll update your code accordingly.
It leaves me with the X86_64 problem which I don't know how to fix.
S?bastien
More information about the ffmpeg-devel
mailing list