[FFmpeg-devel] [PATCH] SSE2 Xvid idct
Pascal Massimino
pascal.massimino
Mon Apr 7 00:26:49 CEST 2008
Michael,
On Sun, Apr 6, 2008 at 11:10 PM, Michael Niedermayer <michaelni at gmx.at>
wrote:
> On Sun, Apr 06, 2008 at 09:39:57PM +0200, Pascal Massimino wrote:
> > Hi,
> >
> > On Sun, Apr 6, 2008 at 6:14 PM, Michael Niedermayer <michaelni at gmx.at>
> > wrote:
> >
> > >
> > > > skal agreed it could be under LGPL in the last thread.
> > >
> > yep
> >
> >
> > >
> > > [...]
> > > > #define SKIP_ROW_CHECK(src) \
> > > > "movq "src", %%mm0 \n\t" \
> > > > "por 8+"src", %%mm0 \n\t" \
> > > > "packssdw %%mm0, %%mm0 \n\t" \
> > > > "movd %%mm0, %%eax \n\t" \
> > > > "testl %%eax, %%eax \n\t" \
> > > > "jz 1f \n\t"
> > >
> > > You could try to check pairs of rows, this might be faster for some
> rows.
> > > Also the code should be interleaved not form such nasty dependancy
> chains
> > > you do have enogh mmx registers.
> >
> >
> > just a quick note: you can try doing the same with
> > some 'pmovmskb mmreg, eax' instructions.
> > However, this is a complex instruction and the speed gain
> > is not necessarily obvious.
>
> Great idea, i think it could be faster (with SSE registers) due to the
> slowness of packssdw.
> psadbw could be tried as well as alternative.
>
>
> >
> >
> > >
> > > [...]
> > > > "movdqa %%xmm2, ("dct") \n\t" \
> > > > "movdqa %%xmm3, %%xmm2 \n\t" \
> > > > "psubsw %%xmm6, %%xmm3 \n\t" \
> > > > "paddsw %%xmm2, %%xmm6 \n\t" \
> > > > "movdqa %%xmm6, %%xmm2 \n\t" \
> > > > "psubsw %%xmm7, %%xmm6 \n\t" \
> > > > "paddsw %%xmm2, %%xmm7 \n\t" \
> > > > "movdqa %%xmm3, %%xmm2 \n\t" \
> > > > "psubsw %%xmm5, %%xmm3 \n\t" \
> > > > "paddsw %%xmm2, %%xmm5 \n\t" \
> > > > "movdqa %%xmm5, %%xmm2 \n\t" \
> > > > "psubsw %%xmm0, %%xmm5 \n\t" \
> > > > "paddsw %%xmm2, %%xmm0 \n\t" \
> > > > "movdqa %%xmm3, %%xmm2 \n\t" \
> > > > "psubsw %%xmm4, %%xmm3 \n\t" \
> > > > "paddsw %%xmm2, %%xmm4 \n\t" \
> > > > "movdqa ("dct"), %%xmm2 \n\t" \
> > >
> > > i suspect this can be written without the load/store by using
> > > add,add,sub buterflies (of course only if it is faster)
> >
> >
> > iirc, i tried that and it's the same ticks count using the add,add,sub
> > butterfly. Plus, i may be wrong, but i recall that the saturations used
> > with the 'regular' mov,add,sub butterfly helps for nasty corner cases
> of
> > overflow.
>
> hmm, i dont see how
> The output of the IDCT is approximately within +-255, and due to rounding
> and
> quantization it can be more, IIRC some standard specified +-384
hmm... i think it's [-300,300] actually, with two "sign" mode.
>
>
> Now if we assume -384 .. +384 output then traceing backward
> we would have -24576 ... +24576 before the >>6 and similarly
> before any butterflies.
>
well, yes, i was rather speaking theoretically:
Starting around 16384, you have problem with the sign bit:
Example: a=-16384, b=-16385. Then the exact result is:
{a+b,b-a} = {-32769 (non representable), -1}
a) With movq,addsw,subsw you get: {-32768 (saturated),-1}
b) With psubw,addw,addw you get: {32767 (that's -32769 non saturated),-1}
Both results are incorrect because the input is out of range, but
i somehow consider result a) "less wrong", especially if you are going
to clip to [0,255] afterward (since the sign is correct at least, so
clipping
occurs in the right direction).
But yes, in our case of +-255 input, it's not necessarily a big deal.
skal
> 1. So none of the saturation cases in the current butterflies should ever
> trigger.
> 2. Due to 1. they are equivalent to paddw/psubw
> 3. As twos complement numbers form a abelian group in respect to
> paddw/psubw
> we can apply the associative, kommutative, inverse, identity laws
> without
> concern.
> 4. B= a + (-b)
> B= (a + 0) + (-b) (identity)
> B= (a + (a + (-a))) + (-b) (inverse)
> B= ((a + a) + (-a)) + (-b) (associative)
> B= (a + a) + ((-a) + (-b)) (associative)
> B= (a + a) + (-(a + b)) ("product" of inverese)
> at that point we just have
> A= a + b
> t= a + a
> B= t - A = a - b
>
> [...]
More information about the ffmpeg-devel
mailing list