[FFmpeg-devel] Fixpoint FFT optimization, with MDCT and IMDCT wrappers for audio optimization

Mon Jul 30 15:11:34 CEST 2007

On 7/30/07, Michael Niedermayer <michaelni at gmx.at> wrote:
>
> Hi
>
> On Sun, Jul 29, 2007 at 08:13:38PM -0400, Marc Hoffman wrote:
> > On 7/29/07, Diego Biurrun <diego at biurrun.de> wrote:
> > > On Sun, Jul 29, 2007 at 07:20:59PM -0400, Marc Hoffman wrote:
> > > >
> > > > sorry about the mime type gmail doesn't allow me to mark it as
> > > > text/x-patch.  This makes config changes.
> > >
> > > > --- configure (revision 9807)
> > > > +++ configure (working copy)
> > > > @@ -573,6 +574,7 @@
> > > >      bktr
> > > >      dc1394
> > > >      dv1394
> > > > +    fixedpoint
> > > >      ffmpeg
> > > >      ffplay
> > > >      ffserver
> > > > @@ -665,6 +667,7 @@
> > > >      fast_64bit
> > > >      fast_cmov
> > > >      fast_unaligned
> > > > +    fixedpoint
> > > >      fork
> > > >      freetype2
> > > >      GetProcessTimes
> > >
> > > Just CONFIG_LIST is enough.
> > >
> > > > --- libavcodec/Makefile       (revision 9807)
> > > > +++ libavcodec/Makefile       (working copy)
> > > > @@ -358,6 +358,10 @@
> > > >  OBJS-$(CONFIG_VP6F_DECODER)            += i386/vp3dsp_mmx.o
> i386/vp3dsp_sse2.o
> > > >  endif
> > > >
> > > > +ifeq ($(HAVE_FIXEDPOINT),yes)
> > > > +OBJS += fft_fixedpoint.o
> > > > +endif
> > >
> > > Do this in one line, like for all the other files.
> > >
> >
> > Ok guys, I removed myself from the have list...  And correct the
> > makefile like you asked before.  Much simpiler.  Again sorry about the
> > mime attachment....
> [...]
> > +/*
> > +  This is a fixpoint inplace 16bit FFT which accepts 3 arguments:
> > +
> > +  @param X   - input signal in format 1.15
> > +  @param W   - phase factors in 1.15 format
> > +  @param lgN - log_2(N) where N is the size of the input data set.
> > +
> > +  X is the output and its adjusted format is S(1+lgN.15-lgN) i.e.
> > +    if we are talking about a 256 point fft then the output format is
> 9.6.
> > +*/
>
> not doxygen compatible
>
>
> [...]
> > +                tr        = (X[k2].re*wwr + 0x4000)>>15;
> > +                tr       -= (X[k2].im*wwi + 0x4000)>>15;
> > +                ti        = (X[k2].re*wwi + 0x4000)>>15;
> > +                ti       += (X[k2].im*wwr + 0x4000)>>15;
> > +
> > +                X[k2].re  = (X[k].re - tr)>>1;
> > +                X[k2].im  = (X[k].im - ti)>>1;
> > +
> > +                X[k].re   = (X[k].re + tr)>>1;
> > +                X[k].im   = (X[k].im + ti)>>1;
>
> why not >>16 ? that way you would have 4 shifts less

At a greater loss of precision we could do that.  You realize those shifts
are guards against overflow, and they cause precision to be dynamic based on
the number of stages.  This behavior is only for the 16x16 variant the
others don't have this functionality and hence no special cases need to be
tracked outside of the fft module.  However, on some low power devices it
might be acceptable to use this 16-bit variant at the lose of signal
performance.

[...]
> > +        w=0;
> > +     hm = m>>1;
> > +        for (j=0; j<hm; j++) {
>
> tabs

(untabify)

[...]
> > +
> > +                X[k2].re  = (X[k].re - tr);
> > +                X[k2].im  = (X[k].im - ti);
> > +
> > +                X[k].re   = (X[k].re + tr);
> > +                X[k].im   = (X[k].im + ti);
>
> superflous ()

removed.

[...]
> > +                tr        = (X[k2].re*wwr + 0x40000000)>>31;
> > +                tr       -= (X[k2].im*wwi + 0x40000000)>>31;
> > +                ti        = (X[k2].re*wwi + 0x40000000)>>31;
> > +                ti       += (X[k2].im*wwr + 0x40000000)>>31;
>
> >>32 !!
> with >>31 this code just is not usefull, cpus tend to have 32bit registers
> not 31bit so this is just the same as the other code
> with >>32 AND without the + 0x... several operations could be avoided
> thus making this as fast as the 16bit code on a reasonable cpu

What is signed arithmetic?  Again we can do that at the loss of precision,
and then have to track the format dynamically its a bit of a mess and its
generally not how you do these kinds of things.

[...]
> > +    FFTComplex16 *v = av_malloc (sizeof (short)*n);
>
> types missmatch

done

[...]
> > +/* complex multiplication: p = a * b */
> > +#define CMUL(pre, pim, are, aim, bre, bim) \
>
> not doxygen compatible

done

[...]
> > Index: libavcodec/fft-test.c
> > ===================================================================
> > --- libavcodec/fft-test.c     (revision 9807)
> > +++ libavcodec/fft-test.c     (working copy)
>
> this is a mess, see dct-test.c for how to test several implementations of
> some code

I will change this infastructure to work much the same way as we/I did in
dct-test no problem separate patch.

Marc