[FFmpeg-devel] [PATCH] swr/resample: use fma when it is faster

Mon Dec 14 15:11:09 CET 2015

On Sun, Dec 13, 2015 at 10:25 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> Hi,
>
> On Sun, Dec 13, 2015 at 7:29 PM, Ganesh Ajjanagadde <gajjanag at mit.edu>
> wrote:
>
>> The worst part is that it is a bad idea to do runtime dispatch on the
>> fma() itself, as the function call overhead will be nonneglible, and
>> so one can't create a helper API in avutil or elsewhere. Thus, it can
>> only be used when a function is in a critical hotspot, where the
>> duplication of code and maintainence burden can be justified for the
>> performance benefits. I might be missing something here though.
>
>
> You would DSP'ize the loop, not the single fma instruction, right?
> Depending on the size of the array (i.e. the size variable), it may be ok.

That is a general problem: fma is useful in a variety of contexts,
some of which do not naturally map into e.g a level one BLAS a'*b + c.
Thus, in an ideal world (like if I was just developing for my own
machine), I would simply use fma whenever instead of a x += y * z and
reap a cheap performance gain. I was planning on demonstrations for
vsrc_mandelbrot, avutil/lls (cholesky code), but as you pointed out
originally, this cheap method is not something FFmpeg can accept.

This lack of generality and inability to create such a generic, easy
to use fma wrapper across FFmpeg is what I was referring to here, and
not this particular case.

More concretely addressing your question: I avoided this, since
keeping the polynomial evaluation inline can potentially offer a smart
compiler greater room for optimization in bessel here. For instance, I
am not an asm person, but based on what I know of the simd idea, the
numerator and denominator polynomials can be evaluated in parallel, at
least until the point where their degrees match. This depends on:
1. The compiler unrolling all of these loops, which it can in
principle as it knows the sizes of the arrays, and they are quite
small (max 15).
2. The compiler being able to auto-vectorize the relevant computations.
3. Any alignment or other relevant hackery being done of which I know
nothing of.

Anyway, the short summary is: I like keeping code as generic as
possible. I won't write asm for this particular case; anyone
interested is free to create an optimized bessel routine - for someone
with the know-how, it should be trivial. Of course, it is not used in
speed-critical code, and hence I don't like it myself. Note:
avcodec/kbdwin also uses a bessel that is inferior to the current
code, so maybe there is some utility.

>
> Ronald
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel