[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Sun Apr 6 21:39:57 CEST 2008

  Hi,

On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at>
wrote:

>
> > skal agreed it could be under LGPL in the last thread.
>
 yep

>
> [...]
> > #define SKIP_ROW_CHECK(src)                 \
> >     "movq     "src", %%mm0            \n\t" \
> >     "por    8+"src", %%mm0            \n\t" \
> >     "packssdw %%mm0, %%mm0            \n\t" \
> >     "movd     %%mm0, %%eax            \n\t" \
> >     "testl    %%eax, %%eax            \n\t" \
> >     "jz 1f                            \n\t"
>
> You could try to check pairs of rows, this might be faster for some rows.
> Also the code should be interleaved not form such nasty dependancy chains
> you do have enogh mmx registers.

 just a quick note: you can try doing the same with
 some 'pmovmskb mmreg, eax' instructions.
 However, this is a complex instruction and the speed gain
 is not necessarily obvious.

>
> [...]
> >     "movdqa   %%xmm2, ("dct")         \n\t" \
> >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> >     "psubsw   %%xmm6, %%xmm3          \n\t" \
> >     "paddsw   %%xmm2, %%xmm6          \n\t" \
> >     "movdqa   %%xmm6, %%xmm2          \n\t" \
> >     "psubsw   %%xmm7, %%xmm6          \n\t" \
> >     "paddsw   %%xmm2, %%xmm7          \n\t" \
> >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> >     "psubsw   %%xmm5, %%xmm3          \n\t" \
> >     "paddsw   %%xmm2, %%xmm5          \n\t" \
> >     "movdqa   %%xmm5, %%xmm2          \n\t" \
> >     "psubsw   %%xmm0, %%xmm5          \n\t" \
> >     "paddsw   %%xmm2, %%xmm0          \n\t" \
> >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> >     "psubsw   %%xmm4, %%xmm3          \n\t" \
> >     "paddsw   %%xmm2, %%xmm4          \n\t" \
> >     "movdqa  ("dct"), %%xmm2          \n\t" \
>
> i suspect this can be written without the load/store by using
> add,add,sub buterflies (of course only if it is faster)

  iirc, i tried that and it's the same ticks count using the add,add,sub
 butterfly. Plus, i may be wrong, but i recall that the saturations used
 with the 'regular' mov,add,sub butterfly helps for nasty corner cases of
 overflow.

 I'll try and save some cycles to review the rest asap

skal