[Ffmpeg-devel] [PATCH] SSE counterpart of ff_imdct_calc_3dn2
Michael Niedermayer
michaelni
Sun Aug 20 16:04:27 CEST 2006
Hi
On Sun, Aug 20, 2006 at 06:15:06PM +0800, Zuxy Meng wrote:
> Hi,
>
> The patch is simply a re-write of Loren's recent work. fft-test shows
> a speed-up around 18%~20% in my Pentium M 2G, not very exciting but
> faster indeed. Please kindly take a review.
no objections to the patch but see comments below
[...]
> +void ff_imdct_calc_sse(MDCTContext *s, FFTSample *output,
> + const FFTSample *input, FFTSample *tmp)
> +{
> + long k, n8, n4, n2, n;
> + const uint16_t *revtab = s->fft.revtab;
> + const FFTSample *tcos = s->tcos;
> + const FFTSample *tsin = s->tsin;
> + const FFTSample *in1, *in2;
> + FFTComplex *z = (FFTComplex *)tmp;
> +
> + n = 1 << s->nbits;
> + n2 = n >> 1;
> + n4 = n >> 2;
> + n8 = n >> 3;
> +
> + asm volatile ("movaps %0, %%xmm7\n\t"::"m"(*p1m1p1m1));
> +
> + /* pre rotation */
> + in1 = input;
> + in2 = input + n2 - 4;
> +
> + /* Complex multiplication
> + Two complex products per iteration, we could have 4 with 8 xmm
> + registers, 8 with 16 xmm registers.
> + Maybe we should unroll more.
> + */
> + for (k = 0; k < n4; k += 2) {
> + asm volatile (
> + "movaps %0, %%xmm0 \n\t" // xmm0 = r0 X r1 X : in2
> + "movaps %1, %%xmm3 \n\t" // xmm3 = X i1 X i0: in1
> + "movlps %2, %%xmm1 \n\t" // xmm1 = X X R1 R0: tcos
> + "movlps %3, %%xmm2 \n\t" // xmm2 = X X I1 I0: tsin
> + "shufps $95, %%xmm0, %%xmm0 \n\t" // xmm0 = r1 r1 r0 r0
> + "shufps $160,%%xmm3, %%xmm3 \n\t" // xmm3 = i1 i1 i0 i0
> + "unpcklps %%xmm2, %%xmm1 \n\t" // xmm1 = I1 R1 I0 R0
the above and one memory read can be avoided by changing the tsin/tcos tables
that would also reduce the number of pointers and maybe avoid the register
shortage gcc ends up with below
> + "movaps %%xmm1, %%xmm2 \n\t" // xmm2 = I1 R1 I0 R0
> + "xorps %%xmm7, %%xmm2 \n\t" // xmm2 = -I1 R1 -I0 R0
> + "mulps %%xmm1, %%xmm0 \n\t" // xmm0 = rI rR rI rR
> + "shufps $177,%%xmm2, %%xmm2 \n\t" // xmm2 = R1 -I1 R0 -I0
> + "mulps %%xmm2, %%xmm3 \n\t" // xmm3 = Ri -Ii Ri -Ii
> + "addps %%xmm3, %%xmm0 \n\t" // xmm0 = result
> + ::"m"(in2[-2*k]), "m"(in1[2*k]),
> + "m"(tcos[k]), "m"(tsin[k])
> + );
> + /* Should be in the same block, hack for gcc2.95 & gcc3 */
> + asm (
> + "movlps %%xmm0, %0 \n\t"
> + "movhps %%xmm0, %1 \n\t"
> + :"=m"(z[revtab[k]]), "=m"(z[revtab[k + 1]])
> + );
> + }
what about writing the whole loop in asm? i bet you can do better then gcc :)
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
In the past you could go to a library and read, borrow or copy any book
Today you'd get arrested for mere telling someone where the library is
More information about the ffmpeg-devel
mailing list