[FFmpeg-devel] [PATCH 11/11] lavu/x86: add FFT assembly

Lynne dev at lynne.ee
Tue Apr 20 01:57:29 EEST 2021


Apr 19, 2021, 22:27 by dev at lynne.ee:

> This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx. 
> The design of this pure assembly FFT is pretty unconventional.
>

Oh, I forgot to mention _why_ on the majority of transforms it's slower than FFTW.
It's simple - we don't hardcode anything. In fact, the biggest bottleneck by far is
the lookup done during loading of input data. Easily more than 40% of the entire
time is spent just doing lookups.
FFTW hardcodes the addresses in the transforms themselves (they call them
codelets). And in the process duplicate herculean amounts of code, amplified by
their just-as-inefficient Split-Radix refactoring which needs 4 versions of each codelet.
The 5+MB stripped binary can't get fat by itself, after all.
Also FFTW only supports a single extension (e.g. fma, avx, avx2) at once during
compile time, so that size figure looks even worse.

For an in-place pre-permuted transform where all the lookups are gone and replaced
with simple loads, we're actually consistently faster than FFTW. Thus, once
non-power-of-two transforms are implemented (where this situation happens),
we'll be faster than FFTW by quite a margin. The old non-power-of-two FFT
code is faster than FFTW by a bit, and this code's C version is quite a bit faster
than that code's C version.
I'd have implemented this as well before I sent the patch, but there are only so many
hours in a day, and ways to ignore the outside world, and the 5.0 release demands
my urgent attention, so I'll write the non-power-of-two part up when I feel like it in
hopefully a week or two.

An easy way to gauge optimization is the perf graph:
 39.58%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.32pt
 21.17%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.64pt
   8.21%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_deinterleave
   3.90%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_32768
   3.15%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_8192
   3.08%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_16384
   3.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_65536
   2.97%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_512
   2.96%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_4096
   2.96%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_2048
   2.95%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.256pt
   2.91%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.128pt
   2.89%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_1024
   0.04%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.512pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.1024pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.4096pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.2048pt
   0.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.131072pt
   0.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.8192pt

Only the 32-point and 64-point transforms read data from the input,
and that's where 60% of the overhead is.

A fast vgatherdpd will go a long way to speed this up, as well as
AVX512. This code should be quite a bit faster on Ice Lake and Tiger Lake
systems, where vgatherdpd was significantly improved over Skylake.
Maybe someone can run checkasm on one of those systems.


More information about the ffmpeg-devel mailing list