[FFmpeg-devel] [PATCH] SSE2 Xvid idct

Mon Apr 7 00:26:49 CEST 2008

  Michael,

On Sun, Apr 6, 2008 at 11:10 PM, Michael Niedermayer <michaelni at gmx.at>
wrote:

> On Sun, Apr 06, 2008 at 09:39:57PM +0200, Pascal Massimino wrote:
> >   Hi,
> >
> > On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at>
> > wrote:
> >
> > >
> > > > skal agreed it could be under LGPL in the last thread.
> > >
> >  yep
> >
> >
> > >
> > > [...]
> > > > #define SKIP_ROW_CHECK(src)                 \
> > > >     "movq     "src", %%mm0            \n\t" \
> > > >     "por    8+"src", %%mm0            \n\t" \
> > > >     "packssdw %%mm0, %%mm0            \n\t" \
> > > >     "movd     %%mm0, %%eax            \n\t" \
> > > >     "testl    %%eax, %%eax            \n\t" \
> > > >     "jz 1f                            \n\t"
> > >
> > > You could try to check pairs of rows, this might be faster for some
> rows.
> > > Also the code should be interleaved not form such nasty dependancy
> chains
> > > you do have enogh mmx registers.
> >
> >
> >  just a quick note: you can try doing the same with
> >  some 'pmovmskb mmreg, eax' instructions.
> >  However, this is a complex instruction and the speed gain
> >  is not necessarily obvious.
>
> Great idea, i think it could be faster (with SSE registers) due to the
> slowness of packssdw.
> psadbw could be tried as well as alternative.
>
>
> >
> >
> > >
> > > [...]
> > > >     "movdqa   %%xmm2, ("dct")         \n\t" \
> > > >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> > > >     "psubsw   %%xmm6, %%xmm3          \n\t" \
> > > >     "paddsw   %%xmm2, %%xmm6          \n\t" \
> > > >     "movdqa   %%xmm6, %%xmm2          \n\t" \
> > > >     "psubsw   %%xmm7, %%xmm6          \n\t" \
> > > >     "paddsw   %%xmm2, %%xmm7          \n\t" \
> > > >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> > > >     "psubsw   %%xmm5, %%xmm3          \n\t" \
> > > >     "paddsw   %%xmm2, %%xmm5          \n\t" \
> > > >     "movdqa   %%xmm5, %%xmm2          \n\t" \
> > > >     "psubsw   %%xmm0, %%xmm5          \n\t" \
> > > >     "paddsw   %%xmm2, %%xmm0          \n\t" \
> > > >     "movdqa   %%xmm3, %%xmm2          \n\t" \
> > > >     "psubsw   %%xmm4, %%xmm3          \n\t" \
> > > >     "paddsw   %%xmm2, %%xmm4          \n\t" \
> > > >     "movdqa  ("dct"), %%xmm2          \n\t" \
> > >
> > > i suspect this can be written without the load/store by using
> > > add,add,sub buterflies (of course only if it is faster)
> >
> >
> >   iirc, i tried that and it's the same ticks count using the add,add,sub
> >  butterfly. Plus, i may be wrong, but i recall that the saturations used
> >  with the 'regular' mov,add,sub butterfly helps for nasty corner cases
> of
> >  overflow.
>
> hmm, i dont see how
> The output of the IDCT is approximately within +-255, and due to rounding
> and
> quantization it can be more, IIRC some standard specified +-384

  hmm... i think it's [-300,300] actually, with two "sign" mode.

>
>
> Now if we assume -384 .. +384 output then traceing backward
> we would have -24576 ... +24576 before the >>6 and similarly
> before any butterflies.
>

  well, yes, i was rather speaking theoretically:
  Starting around 16384, you have problem with the sign bit:

  Example: a=-16384, b=-16385. Then the exact result is:
    {a+b,b-a} = {-32769 (non representable),  -1}
  a) With movq,addsw,subsw you get: {-32768 (saturated),-1}
  b) With psubw,addw,addw you get: {32767 (that's -32769 non saturated),-1}

  Both results are incorrect because the input is out of range, but
  i somehow consider result a)  "less wrong", especially if you are going
  to clip to [0,255] afterward  (since the sign is correct at least, so
clipping
  occurs in the right direction).

  But yes, in our case of +-255 input, it's not necessarily a big deal.

skal

> 1. So none of the saturation cases in the current butterflies should ever
>   trigger.
> 2. Due to 1. they are equivalent to paddw/psubw
> 3. As twos complement numbers form a abelian group in respect to
> paddw/psubw
>   we can apply the associative, kommutative, inverse, identity laws
> without
>   concern.
> 4. B= a + (-b)
>   B= (a + 0) + (-b) (identity)
>   B= (a + (a + (-a))) + (-b) (inverse)
>   B= ((a + a) + (-a)) + (-b) (associative)
>   B= (a + a) + ((-a) + (-b)) (associative)
>   B= (a + a) + (-(a + b)) ("product" of inverese)
>   at that point we just have
>   A= a + b
>   t= a + a
>   B= t - A = a - b
>
> [...]