[FFmpeg-devel] [RFC] Disable compile-time tablegen for cbrt if total cycle count < 200000
gajjanag at mit.edu
Sat Jan 2 22:56:16 CET 2016
On Sat, Jan 2, 2016 at 1:10 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
> On Sat, Jan 2, 2016 at 4:08 PM, Ganesh Ajjanagadde <gajjanag at mit.edu> wrote:
>> On Sat, Jan 2, 2016 at 1:02 PM, Ronald S. Bultje <rsbultje at gmail.com>
>> > Hi,
>> > On Sat, Jan 2, 2016 at 3:23 PM, Ganesh Ajjanagadde <gajjanag at mit.edu>
>> > wrote:
>> >> On Fri, Jan 1, 2016 at 8:07 AM, Ganesh Ajjanagadde <gajjanag at mit.edu>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > Motivated by a remark by Ronald:
>> >> > https://ffmpeg.org/pipermail/ffmpeg-devel/2016-January/186200.html,
>> >> > this is a request for comment on disabling compile time tablegen for
>> >> > cbrt if the total cycle count < 200000. Note that cbrt tables are
>> >> > only
>> >> > used in aacdec.
>> >> To start some effort towards a more principled understanding of the
>> >> costs of runtime table initialization, I did some benchmarks.
>> >> Note: I am not familiar with avcodec, so I don't know if this reflects
>> >> correctly the static vs dynamic cost.
>> >> file: ~/samples/aac/al04_44.mp4
>> >> stream_loop: 100
>> >> number of calls of avcodec_decode_audio4: 35956
>> >> cost per call (avcodec_decode_audio4):
>> >> 834030 decicycles in decode_audio4, 1 runs, 0 skips
>> >> 556200 decicycles in decode_audio4, 2 runs, 0 skips
>> >> [...]
>> >> 177365 decicycles in decode_audio4, 16384 runs, 0 skips
>> >> 177059 decicycles in decode_audio4, 32768 runs, 0 skips
>> >> decoding cost: 17706*35956 = 636,636,936 cycles
>> >> duration: 832.55 seconds
>> >> cost per second of audio: 764,683 cycles
>> >> cost of table init: 200,000 cycles
>> >> fraction: 0.26
>> >> So in a clip of n seconds duration, the relative overhead of dynamic
>> >> initialization of these cbrt tables is 0.26/n. For a more concrete
>> >> number, say a clip is of 180 seconds duration, then the overhead is
>> >> 0.26/180 = 0.15%.
>> > What if I only want to play the first 3 second of 1000 clips by calling
>> > ffmpeg.exe in a shell script? E.g. for fingerprinting. The number of use
>> > cases you cover needs to be more than just playback, ffmpeg can do much
>> > more
>> > than just that.
>> Two remarks:
>> 1. As I said, this was only a start of the discussion; and the general
>> c/t decay holds; constant c should be close to what I obtained. So
>> yes, if you have such a thing, it will be slower.
>> 2. I thought ffmpeg had the ability to handle multiple input files in
>> a single invocation? Thus, someone doing such a thing is IMHO doing it
> ffmpeg has a lot of abilities. But most of our users are not harvard (or MIT
> :-) ) PhDs, so they're unlikely to do it in the most optimal way, and very
> likely to do it in the easiest way.
I don't deny that, but isn't it also true that most non power users
will not discover or use --enable-hardcoded-tables? Maybe now that I
have added perf build notes to the wiki:
https://trac.ffmpeg.org/wiki/CompilationGuide, some readers of the
wiki will try to use it, I can't say.
There is also an aspect that the program invocation overhead itself is
there, with both os level stuff and FFmpeg's own internal
initialization separate from this particular table, or even separate
from other table generation steps. It may turn out that the 200,000
cycles is a small fraction of the net startup cost. I deliberately
avoided benching this to keep the focus narrow, but if you think it is
useful for perspective on this thread, I can add it.
More information about the ffmpeg-devel