[FFmpeg-devel] Fixpoint FFT optimization, with MDCT and IMDCT wrappers for audio optimization

Wed Aug 1 22:58:36 CEST 2007

On 8/1/07, Michael Niedermayer <michaelni at gmx.at> wrote:
> Hi
>
> On Mon, Jul 30, 2007 at 10:33:16PM -0400, Marc Hoffman wrote:
> [...]
> > I will do it if it makes sense right now I don't see it being the most
> > efficient. Lets get through the basic acceptance and when you and I
> > decide to move forward we can talk about more efficient mechanisms for
> > general machines.  (I'm not going anywhere so even if you accept this
> > we can change it in the future).
> >
> > The split radix is not the most efficient way to do things on the
> > BlackFin machine.  It has to do with all the extra pointer stuff you
> > need to maintain.  On other machines this is more efficient.  Not to
> > go into this too much anyways I agreed earlier to implement this for
> > us (ffmpeg-devel) and I will just not right now.  I really want to see
> > if I/we can get one audio codec to work in fixEDpoint and achieve high
> > quality I think this is what you/we really care about anyways.
>
> iam not sure what extra pointers you are talking about
> if you are thinking of a recursive implementation which gets 3 pointers
> to the 3 input parts this is unacceptable the current cooley tukey FFT
> also isnt written by passing pointers recursively around
>
>

Just to clarify... In terms of pure number of operations required,
split radix is more efficient than radix-4. But implemetation on a
specific platform may not be. For example, BlackFin 16x16 radix-4 FFT
kernel has 6 MAC cycles (to do 3 complex multiplies) and 4
add/subtract cycles. Split radix in best case improves multiplies by
about 10% and add/subtracts by a very negligible amount. Thus, total
theoretical improvement is .6 cycles (from MACs), i.e. 10 kernel
cycles become 9.4 cycles. Thus, the total improvement is only 6%.
Combine this with first two stages (that should be done separately
since they have no multiplies) and improvement is even less.  Now the
overhead of split radix (such as addressing) negates the cycle
improvements altogether.

Finding the most efficient algorithm is not as simple as finding the
algorithm with the lowest number of operations. One usually has to
consider the peculiarities of the architecture in question.