[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)

Wed Dec 2 12:52:47 CET 2009

Thilo Borgmann schrieb:
> Michael Niedermayer schrieb:
>> On Mon, Nov 30, 2009 at 04:09:23PM +0100, Thilo Borgmann wrote:
>>> Thilo Borgmann schrieb:
>>>> M?ns Rullg?rd schrieb:
>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>
>>>>>> M?ns Rullg?rd schrieb:
>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>
>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>
>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>>>>>
>>>>>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>>>>>> file only using these technique.
>>>>>>>>> I'm confused.  Can it be done in the C code only or not?  This kind of
>>>>>>>>> issue should really not be solved in the makefile.
>>>>>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>>>>>> function into two, which are then called successively.
>>>>>>>>
>>>>>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>>>>>> with and without the -finline-limit option. I can use av_always_inline
>>>>>>>> for many functions within alsdec.c to have the same functions inlined
>>>>>>>> like -finline-limit does.
>>>>>>>>
>>>>>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>>>>>> the patch while using av_always_inline does not.
>>>>>>> So it's not doing the same thing.  What is it doing differently?
>>>>>>> Where did you get the limit number from?
>>>>>>>
>>>>>> All function calls within alsdec.s when using -finline-limit=4096:
>>>>>>    1 	call	L1102
>>>>>>    1 	call	L138
>>>>>>    1 	call	L456
>>>>>>    2 	call	L___udivdi3$stub
>>>>>>   10 	call	L_av_freep$stub
>>>>>>    1 	call	L_av_get_bits_per_sample_format$stub
>>>>>>   12 	call	L_av_log$stub
>>>>>>    5 	call	L_av_log_missing_feature$stub
>>>>>>    8 	call	L_av_malloc$stub
>>>>>>    2 	call	L_av_mallocz$stub
>>>>>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>>>>>    6 	call	L_memcpy$stub
>>>>>>    2 	call	L_memmove$stub
>>>>>>    1 	call	L_memset$stub
>>>>>>    2 	call	_decode_blocks_ind
>>>>>>    4 	call	_decode_end
>>>>>>   36 	call	_decode_rice
>>>>>>   10 	call	_get_bits_long
>>>>>>   11 	call	_parse_bs_info
>>>>>>    2 	call	_zero_remaining
>>>>>>
>>>>>> All function calls within alsdec.s when using many av_always_inline's.
>>>>>> This is designed to inline the same functions from alsdec.c like the
>>>>>> unpatched alsdec.c would yield without any extra build option:
>>>>>>    1 	call	L1561
>>>>>>    1 	call	L176
>>>>>>    1 	call	L21
>>>>>>    2 	call	L___udivdi3$stub
>>>>>>   10 	call	L_av_freep$stub
>>>>>>    1 	call	L_av_get_bits_per_sample_format$stub
>>>>>>   13 	call	L_av_log$stub
>>>>>>    5 	call	L_av_log_missing_feature$stub
>>>>>>    8 	call	L_av_malloc$stub
>>>>>>    2 	call	L_av_mallocz$stub
>>>>>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>>>>>    1 	call	L_memcpy$stub
>>>>>>    1 	call	L_memmove$stub
>>>>>>    2 	call	L_memset$stub
>>>>>>    8 	call	___inline_memcpy_chk
>>>>>>    2 	call	___inline_memmove_chk
>>>>>>    6 	call	_align_get_bits
>>>>>>    5 	call	_av_ceil_log2
>>>>>>    4 	call	_av_clip
>>>>>>    4 	call	_decode_end
>>>>>>   47 	call	_get_bits
>>>>>>   90 	call	_get_bits1
>>>>>>    3 	call	_get_bits_count
>>>>>>   61 	call	_get_bits_left
>>>>>>   39 	call	_get_bits_long
>>>>>>    4 	call	_get_sbits_long
>>>>>>   60 	call	_get_unary
>>>>>>    2 	call	_init_get_bits
>>>>>>    3 	call	_parse_bs_info
>>>>>>    3 	call	_read_time
>>>>>>    7 	call	_skip_bits
>>>>>>    2 	call	_skip_bits1
>>>>>>    5 	call	_skip_bits_long
>>>>> Not inlining those get_bits etc will certainly slow things down,
>>>>> that's for sure.
>>>>>
>>>>>> So -finline-limit can inline many functions in the object file which are
>>>>>> not part of alsdec.c. Which might be the reason for the performance
>>>>>> difference.
>>>>>>
>>>>>> But using -finline-limit does not yield a speed gain for the unpatched
>>>>>> file! So there might be something else but I don't see.
>>>>>>
>>>>>> The value of 4096 has been choosen randomly. As long as I don't know
>>>>>> exactly why -finline-limit removes the slowdown and that it cannot be
>>>>>> replaced by another approach, there is no need to figure out a more
>>>>>> optimal value...
>>>>> We should do some benchmarks using that flag globally and see what
>>>>> happens.  Maybe we'd gain from using it everywhere.
>>>> Like Michael said, this would be a big test for different platforms and
>>>> compilers which I cannot offer alone so several people would have to do
>>>> this - if a benchmark would indicate that it might be worth testing.
>>>>
>>>> Also, I'm lacking a good idea of how to test this efficiently without
>>>> having other factors like harddrives playing a predominant role which
>>>> means testing execution time of ffmpeg.
>>> I played around a little with the regression tests and audio decoders.
>>> For most of my tests -finline-limit=4096 makes it a little faster, e.g.
>>>
>>> g726: 47001535 dezicycles -> 41628457 dezicycles (12%)
>>> alac: 12855244 dezicycles -> 12849127 dezicycles ( 0%)
>>> flac:   842020 dezicycles ->   786226 dezicycles ( 7%)
>>> wma:   3663166 dezicycles ->  3197273 dezicycles (14%)
>>>
>>> which is not surprising. Inlining comes for a price, ffmpeg executable
>>> growed from 5,4 MB to 6.1 MB.
>>> Value used fro -finline-limit is 4096, default is 600 for gcc-4.0.
>> what about video codecs? h264, mpeg4, mpeg2 h263 ?
> 
> Can do tomorrow.

h.261: 34067354 dezicycles -> 33048969 dezicycles ( 3%)
h.263: 32138793 dezicycles -> 30895187 dezicycles ( 4%)

For h.264 we are using external libraries and there seems not to be a
regression test on these? (set timer in libx264.c and h264.c and got no
measurements)

Anyway, the video regression tests yield dezicycle measures for around
32 runs which are not really stable...
I tested h263 with a longer video and ended at 512 runs with 1% more
dezicycles needed - so slightly worse in fact.

So I got the impression that the video decoders do not profit from that
compiler option in a way the audio decoders do. Why that is the case, is
another question though.

-Thilo