Michael Niedermayer michaelni
Sat Aug 9 19:16:31 CEST 2008

```On Thu, Aug 07, 2008 at 08:22:35PM -0600, Loren Merritt wrote:
> On Thu, 7 Aug 2008, Michael Niedermayer wrote:
> >
> > iam not sure if its worth it to simplify this, but i think if we dont attempt
> > to mask of the high bits inside the function then the following might work:
> >
> > if(!(i & m))          return  split_radix_permutation(i, m, inverse)<<1;
> > m >>= 1;
> > if(inverse == !(i&m)) return (split_radix_permutation(i, m, inverse)<<2) + 1;
> > else                  return (split_radix_permutation(i, m, inverse)<<2) - 1;
>
> done
>
> > s->revtab[(-split_radix_permutation(i, n, s->inverse)) & (n-1)] = i;
>
> done
>
> > It would be nice if the forced duplication could be limited to
> > #ifndef CONFIG_SMALL unless its significantly slower that way
>
> I tried several combinations of recursive fft##n and/or re-rolling
> pass{,_big} and/or re-rolling fft16 and/or removing pass or pass_big.
> I can make it smaller and retain speed on core2 or prescott, but not both
> cpus at once.
> k8 is equally happy with any version.
>
> 2^4  2^5  2^6  2^7   2^8  2^9   2^10   2^11  2^12         code_size
> penryn:
> 142  417 1120 2837  6589 14935 33433  74609 164273  fft.00  4070
> 142  418 1132 2863  6662 15108 33844  74712 165418  fft.11  3189
> 142  417 1120 2838  6590 14938 46809 114069 282947  fft.10  3133
> 142  462 1231 3011  6982 15769 35297  78270 170920  fft.05  2572
> 142  462 1194 2997  6947 15780 48557 117461 289381  fft.01  2516
> 175  516 1396 3338  7673 17166 51432 123494 301169  fft.03  1652
> 180  542 1411 3414  7853 17452 51895 124489 304666  fft.04  1175
>
> prescott:
> 423 1122 2854 7044 16366 37274 84451 187963 418948  fft.10  2414
> 423 1120 2855 7056 16390 37437 87674 196322 442723  fft.00  3176
> 420 1162 2972 7082 16693 38034 85973 189885 421885  fft.01  1745
> 466 1235 3149 7451 17410 39395 89301 202842 447159  fft.03  1162
> 472 1209 3130 7543 17438 40310 91024 206670 456248  fft.04  830
> 425 1227 3217 8032 18968 43605 98880 219511 487624  fft.11  2532
> 421 1286 3316 8082 19250 44563 99940 223647 495350  fft.05  1872
>
> .00 is the previous patch, all compiled with -Os
> fft.10 (that's removing pass_big) might be a decent compromise if you
> don't care about a huge speed regression in cases that aren't currently
> used by any audio codec.

Pick what you like best, speed on x86 probably does not matter too much
for the CONFIG_SMALL case. Its more usefull for devices with ARM
and rather little storage.
The non CONFIG_SMALL wouldnt be affected by any changes anyway if
i understand correctly ...

>
> >> +    int n = 1<<s->nbits;
> >> +    int i;
> >> +    ff_fft_dispatch_3dn2(z, s->nbits);
> >>      asm volatile("femms");
> >> +    for(i=0; i<n; i+=2)
> >> +        FFSWAP(FFTSample, z[i].im, z[i+1].re);
> >>  }
> >
> > could you elaborate on why this FFSWAP pass is needed?
>
> Intermediate results are not arrays of complex numbers, but rather group
> reals and imaginaries into blocks according to the simd register size. I
> suppose I could merge the swap pass into the last fft pass, like I did for
> sse.

If the swaping could be done (nearly) for free in the last pass that would
be great. If OTOH it would slow down the IMDCT it would probably be better
to leave it as is as we dont really need a FFT anyway. But for others who
might want to borrow out fft it surely would be nicer if no extra swaping
would be needed.

[...]

--

```