[FFmpeg-devel] [PATCH] split-radix FFT

Tue Aug 5 21:05:07 CEST 2008

On Tue, Aug 5, 2008 at 9:10 PM, M?ns Rullg?rd wrote:
> matthieu castet <castet.matthieu at free.fr> writes:
>
>> Siarhei Siamashka wrote:
>>> On Mon, Aug 4, 2008 at 10:27 PM, matthieu castet
>>> <castet.matthieu at free.fr> wrote:
>>>> M?ns Rullg?rd wrote:
>>>>> Loren Merritt <lorenm at u.washington.edu> writes:
>>>>>
>>>
>>> Modern arm cores have hardware fpu which is reasonably fast, so it is
>>> quite questionable if fixed point decoder would be better for such
>>> cores. The same happened for x86 in the past and floating point audio
>>> decoders are now better for modern x86 cores.
>> Do you have some number for fixed-point vs fpu ?
>> I was beveling that arm fpu were quite slow, but may be new hardware are
>> better.
>
> Cortex-A8 can do two single-precision floating-point ops per cycle.
> Double-precision operations are not pipelined, and take mostly 9-17
> cycles.  Cortex-A9 supposedly has a fully pipelined FPU.

FPU from ARM11 is fully pipelined, in this sense Cortex-A8 is a step
back. ARM11 can do one single precision floating point arithmetic
operation per cycle (multiply and accumulate is a single oparation)
and simultaneously two single precision loads or stores at the same
time. Double precision arithmetic operations are twice slower and need
2 cycles per operation.

An example of code that is quite close to the theoretical limit is in
'vector_fmul_vfp' function. It has data processing throughput ~1.7
cycles per one element (two single precision loads, one
multiplication, one store). Something less trivial may be more
difficult to optimize.

Generally the source of ARM11 floating point slowness rumors are the
blogs like this:
http://etrunko.blogspot.com/2008/05/ogg-support-on-canola2.html

They typically compare libvorbis performance with tremor, get much
better results with tremor and make a conclusion that the difference
is caused by slow floating point hardware. But they don't take into
account that libvorbis itself is slow (on x86 too). I benchmarked
ffvorbis vs. tremor in MPlayer on ARM11 and even in its current state,
ffvorbis was slightly faster (that was before the recent ffvorbis
performance optimizations done by Loren Merritt). Poor floating point
implementation can be slower than fixed point implementation and vice
versa, that's up to the coding and optimizations skills of the guys
who are implementing decoders. Both ffvorbis and tremor still can be
optimized better for ARM cores. Anyway even theoretically, fixed point
decoders can be hardly faster because long multiplications (32x32->64)
take more than one cycle. Less precise integer multiplications
(16x32->low 32 bits of the result) take one cycle at best. Fixed point
decoders may have an advantage in power consumption though, but this
needs to be verified in long running battery drain tests.