[FFmpeg-devel] [PATCH] avfilter, swresample, swscale: use fabs, fabsf instead of FFABS
gajjanag at mit.edu
Mon Oct 12 23:27:32 CEST 2015
On Mon, Oct 12, 2015 at 4:57 PM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
> On Mon, Oct 12, 2015 at 7:59 AM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
>> On Mon, Oct 12, 2015 at 7:46 AM, Carl Eugen Hoyos <cehoyos at ag.or.at> wrote:
>>> Ganesh Ajjanagadde <gajjanag <at> mit.edu> writes:
>>>> It is well known that fabs and fabsf are at least as fast and usually
>>>> faster than the FFABS macro, at least on the gcc+glibc combination.
>>> I wasn't aware of this.
>>> And I believe we support other compilers and other
>>> libc implementations.
>> Indeed, which is why performance comparisons are welcome. I argue
>> below why any sane configuration should not regress performance wise.
>> This is also "relevant information" in my view.
>>>> For instance, see the reference:
>>>> This was a patch to glibc in order to remove their usages. Given their
>>>> general performance obsession (more than FFmpeg in many cases), they
>>>> have ensured that fabs and fabsf never peform worse than FFABS.
>>> Ok but is this really related?
>> The reference is, the comment may not be, I was slightly annoyed at
>> FFABS usage when libc provides them on all our platforms, and wanted a
>> justification that would appeal to the FFmpeg crowd, namely peformance
>> to move away from them.
>>>> I have tested on x86-64 Haswell with GCC 5.2 - even with no strict IEEE
>>>> mode enabled, and just the standard -O3 optimizations, there is a
>>>> performance benefit.
>>> This is the only relevant information imo.
>>> Please provide (very, very short) information
>>> on what you tested.
>> Random integers, same style as before. I have not posted numbers,
>> since my numbers are anyway meaningless: I lack non
>> x86-64+(gcc/clang)+glibc configurations.
>> As for that being the only relevant message, I do intend to shorten
>> the message. The long stuff was simply my own personal motivation to
>> make people understand why I did this stuff. Otherwise, I would have
>> sent a separate message anyway in the patch thread, let me know what
>> style you prefer.
>>> Since you mention libc so often: Does the patch
>>> work on win*, aix and other strange platforms?
>> Why not, any standard, conformant fabs/fabsf should. Again, I lack the
>> configurations and am just a university student with a single laptop.
>> fabs and fabsf are already being used elsewhere. Inf anything, they
>> are far better specified on IEEE 754 than FFABS - behavior with NaN,
>> Inf, etc.
> Bench from libavfilter/astats on a 15 min clip. Of course the
> difference is slight, but nonetheless it exists. The best case is the
> same, but look at the difference in the worst cases (as was mentioned
> in the glibc link I gave, I suspect some trickery for subnormal
> floats/Inf/0.0). By the way, I can show results skewing even more
> heavily in favor of fabs by using "random" floating point numbers,
> random in the sense of being a random 64 bit pattern (same style as my
> old crude bench - fill a large array, and test). There, believe it or
> not, I was getting a nearly 1.5-2x improvement.
> Anyway, here it is:
> 4230 decicycles in abs, 1 runs, 0 skips
> 2520 decicycles in abs, 2 runs, 0 skips
> 1635 decicycles in abs, 4 runs, 0 skips
> 967 decicycles in abs, 8 runs, 0 skips
> 635 decicycles in abs, 16 runs, 0 skips
> 473 decicycles in abs, 32 runs, 0 skips
> 389 decicycles in abs, 64 runs, 0 skips
> 350 decicycles in abs, 128 runs, 0 skips
> 331 decicycles in abs, 256 runs, 0 skips
> 321 decicycles in abs, 512 runs, 0 skips
> 319 decicycles in abs, 1024 runs, 0 skips
> 318 decicycles in abs, 2048 runs, 0 skips
> 315 decicycles in abs, 4096 runs, 0 skips
> 317 decicycles in abs, 8192 runs, 0 skips
> 335 decicycles in abs, 16384 runs, 0 skips
> 335 decicycles in abs, 32768 runs, 0 skips
> 333 decicycles in abs, 65536 runs, 0 skips
> 342 decicycles in abs, 131072 runs, 0 skips
> 340 decicycles in abs, 262144 runs, 0 skips
> 345 decicycles in abs, 524285 runs, 3 skips
> 348 decicycles in abs, 1048565 runs, 11 skips
> 351 decicycles in abs, 2097129 runs, 23 skipsbitrate=N/A
> 352 decicycles in abs, 4194252 runs, 52 skipsbitrate=N/A
> 350 decicycles in abs, 8388498 runs, 110 skipsbitrate=N/A
> 351 decicycles in abs,16776993 runs, 223 skipsbitrate=N/A
> 352 decicycles in abs,33553999 runs, 433 skipsbitrate=N/A
> 351 decicycles in abs,67108036 runs, 828 skips
> 3540 decicycles in abs, 1 runs, 0 skips
> 2160 decicycles in abs, 2 runs, 0 skips
> 1447 decicycles in abs, 4 runs, 0 skips
> 881 decicycles in abs, 8 runs, 0 skips
> 594 decicycles in abs, 16 runs, 0 skips
> 455 decicycles in abs, 32 runs, 0 skips
> 382 decicycles in abs, 64 runs, 0 skips
> 361 decicycles in abs, 128 runs, 0 skips
> 356 decicycles in abs, 256 runs, 0 skips
> 334 decicycles in abs, 512 runs, 0 skips
> 322 decicycles in abs, 1024 runs, 0 skips
> 317 decicycles in abs, 2048 runs, 0 skips
> 315 decicycles in abs, 4096 runs, 0 skips
> 341 decicycles in abs, 8192 runs, 0 skips
> 363 decicycles in abs, 16383 runs, 1 skips
> 342 decicycles in abs, 32767 runs, 1 skips
> 354 decicycles in abs, 65532 runs, 4 skips
> 348 decicycles in abs, 131068 runs, 4 skips
> 354 decicycles in abs, 262138 runs, 6 skips
> 356 decicycles in abs, 524277 runs, 11 skips
> 356 decicycles in abs, 1048560 runs, 16 skips
> 354 decicycles in abs, 2097120 runs, 32 skipsbitrate=N/A
> 354 decicycles in abs, 4194251 runs, 53 skipsbitrate=N/A
> 353 decicycles in abs, 8388504 runs, 104 skipsbitrate=N/A
> 353 decicycles in abs,16777006 runs, 210 skipsbitrate=N/A
> 353 decicycles in abs,33553993 runs, 439 skipsbitrate=N/A
> 352 decicycles in abs,67107951 runs, 913 skips
Assuming people are on board with this, and assuming fmin, fmax etc
are available on all our platforms (C99 feature), I strongly suspect
usage of them will improve performance due to similar reasons like
Basically, I think that for all floating point stuff, the FFMAX,
FFMIN, FFABS should not be used. For integers, performance is
identical on any decent config (no magic with FPU stuff needs to be
done); and macro could be useful in places with large number of FFABS
calls: the compiler may not always inline due to code size concerns,
but we may deliberately want it. Nonetheless, such usage should be
kept to a minimum (it should be rare), and IMHO such usage should be
justified with a performance benchmark.
Let me put it another way (and this is what bugged me): I want to flip
things around. Currently, I have to provide (like above)
justifications for using the C standard library functions whenever
available, while FFABS, FFMIN, FFMAX are used willy-nilly (even in
recent code!) even when they are suboptimal as demonstrated above. I
want to flip it so that usage of the macro instead of the standard
library must be justified with a performance benchmark.
Such a benchmark should not be on some ancient config on which many
things are gimped (e.g many asm optimizations) since more common
platforms should not be made to suffer at the expense of obscure ones.
If it occurs on a reasonably common platform, then we should explore
workarounds, like via a configure check (fabs on sane platforms, macro
on IMHO broken ones).
You may ask: why care? There are legitimate uses of floating point
arithmetic in FFmpeg. libavcodec not so much, libswresample definitely
yes, libavfilter often yes. The above is a rough analysis; there may
be nuances. And it often occurs in the heavy computations like looping
over audio samples, not just init routines. An even miniscule benefit
with such a simple, readable change with code "cleanliness"
improvements should IMHO be done.
Anyway, I just wanted to lay out my perspective at the moment, and am
interested in comments on this.
>>> Carl Eugen
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel at ffmpeg.org
More information about the ffmpeg-devel