[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Sun Apr 6 23:10:50 CEST 2008

On Sun, Apr 06, 2008 at 09:39:57PM +0200, Pascal Massimino wrote:
>   Hi,
> 
> On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at>
> wrote:
> 
> >
> > > skal agreed it could be under LGPL in the last thread.
> >
>  yep
> 
> 
> >
> > [...]
> > > #define SKIP_ROW_CHECK(src)                 \
> > >     "movq     "src", %%mm0            \n\t" \
> > >     "por    8+"src", %%mm0            \n\t" \
> > >     "packssdw %%mm0, %%mm0            \n\t" \
> > >     "movd     %%mm0, %%eax            \n\t" \
> > >     "testl    %%eax, %%eax            \n\t" \
> > >     "jz 1f                            \n\t"
> >
> > You could try to check pairs of rows, this might be faster for some rows.
> > Also the code should be interleaved not form such nasty dependancy chains
> > you do have enogh mmx registers.
> 
> 
>  just a quick note: you can try doing the same with
>  some 'pmovmskb mmreg, eax' instructions.
>  However, this is a complex instruction and the speed gain
>  is not necessarily obvious.

Great idea, i think it could be faster (with SSE registers) due to the
slowness of packssdw.
psadbw could be tried as well as alternative.

> 
> 
> >
> > [...]
> > >     "movdqa   %%xmm2, ("dct")         \n\t" \
> > >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> > >     "psubsw   %%xmm6, %%xmm3          \n\t" \
> > >     "paddsw   %%xmm2, %%xmm6          \n\t" \
> > >     "movdqa   %%xmm6, %%xmm2          \n\t" \
> > >     "psubsw   %%xmm7, %%xmm6          \n\t" \
> > >     "paddsw   %%xmm2, %%xmm7          \n\t" \
> > >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> > >     "psubsw   %%xmm5, %%xmm3          \n\t" \
> > >     "paddsw   %%xmm2, %%xmm5          \n\t" \
> > >     "movdqa   %%xmm5, %%xmm2          \n\t" \
> > >     "psubsw   %%xmm0, %%xmm5          \n\t" \
> > >     "paddsw   %%xmm2, %%xmm0          \n\t" \
> > >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> > >     "psubsw   %%xmm4, %%xmm3          \n\t" \
> > >     "paddsw   %%xmm2, %%xmm4          \n\t" \
> > >     "movdqa  ("dct"), %%xmm2          \n\t" \
> >
> > i suspect this can be written without the load/store by using
> > add,add,sub buterflies (of course only if it is faster)
> 
> 
>   iirc, i tried that and it's the same ticks count using the add,add,sub
>  butterfly. Plus, i may be wrong, but i recall that the saturations used
>  with the 'regular' mov,add,sub butterfly helps for nasty corner cases of
>  overflow.

hmm, i dont see how
The output of the IDCT is approximately within +-255, and due to rounding and
quantization it can be more, IIRC some standard specified +-384

Now if we assume -384 .. +384 output then traceing backward
we would have -24576 ... +24576 before the >>6 and similarly
before any butterflies.

1. So none of the saturation cases in the current butterflies should ever
   trigger.
2. Due to 1. they are equivalent to paddw/psubw
3. As twos complement numbers form a abelian group in respect to paddw/psubw
   we can apply the associative, kommutative, inverse, identity laws without
   concern.
4. B= a + (-b)
   B= (a + 0) + (-b) (identity)
   B= (a + (a + (-a))) + (-b) (inverse)
   B= ((a + a) + (-a)) + (-b) (associative)
   B= (a + a) + ((-a) + (-b)) (associative)
   B= (a + a) + (-(a + b)) ("product" of inverese)
   at that point we just have
   A= a + b
   t= a + a
   B= t - A = a - b

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have never wished to cater to the crowd; for what I know they do not
approve, and what they approve I do not know. -- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080406/de15bb2b/attachment.pgp>