[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)

Fri Nov 27 22:23:20 CET 2009

M?ns Rullg?rd schrieb:
> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
> 
>> M?ns Rullg?rd schrieb:
>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>
>>>> M?ns Rullg?rd schrieb:
>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>
>>>>>> M?ns Rullg?rd schrieb:
>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>
>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>> file only using these technique.
>>>>> I'm confused.  Can it be done in the C code only or not?  This kind of
>>>>> issue should really not be solved in the makefile.
>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>> function into two, which are then called successively.
>>>>
>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>> with and without the -finline-limit option. I can use av_always_inline
>>>> for many functions within alsdec.c to have the same functions inlined
>>>> like -finline-limit does.
>>>>
>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>> the patch while using av_always_inline does not.
>>> So it's not doing the same thing.  What is it doing differently?
>>> Where did you get the limit number from?
>>>
>> All function calls within alsdec.s when using -finline-limit=4096:
>>    1 	call	L1102
>>    1 	call	L138
>>    1 	call	L456
>>    2 	call	L___udivdi3$stub
>>   10 	call	L_av_freep$stub
>>    1 	call	L_av_get_bits_per_sample_format$stub
>>   12 	call	L_av_log$stub
>>    5 	call	L_av_log_missing_feature$stub
>>    8 	call	L_av_malloc$stub
>>    2 	call	L_av_mallocz$stub
>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>    6 	call	L_memcpy$stub
>>    2 	call	L_memmove$stub
>>    1 	call	L_memset$stub
>>    2 	call	_decode_blocks_ind
>>    4 	call	_decode_end
>>   36 	call	_decode_rice
>>   10 	call	_get_bits_long
>>   11 	call	_parse_bs_info
>>    2 	call	_zero_remaining
>>
>> All function calls within alsdec.s when using many av_always_inline's.
>> This is designed to inline the same functions from alsdec.c like the
>> unpatched alsdec.c would yield without any extra build option:
>>    1 	call	L1561
>>    1 	call	L176
>>    1 	call	L21
>>    2 	call	L___udivdi3$stub
>>   10 	call	L_av_freep$stub
>>    1 	call	L_av_get_bits_per_sample_format$stub
>>   13 	call	L_av_log$stub
>>    5 	call	L_av_log_missing_feature$stub
>>    8 	call	L_av_malloc$stub
>>    2 	call	L_av_mallocz$stub
>>    1 	call	L_ff_mpeg4audio_get_config$stub
>>    1 	call	L_memcpy$stub
>>    1 	call	L_memmove$stub
>>    2 	call	L_memset$stub
>>    8 	call	___inline_memcpy_chk
>>    2 	call	___inline_memmove_chk
>>    6 	call	_align_get_bits
>>    5 	call	_av_ceil_log2
>>    4 	call	_av_clip
>>    4 	call	_decode_end
>>   47 	call	_get_bits
>>   90 	call	_get_bits1
>>    3 	call	_get_bits_count
>>   61 	call	_get_bits_left
>>   39 	call	_get_bits_long
>>    4 	call	_get_sbits_long
>>   60 	call	_get_unary
>>    2 	call	_init_get_bits
>>    3 	call	_parse_bs_info
>>    3 	call	_read_time
>>    7 	call	_skip_bits
>>    2 	call	_skip_bits1
>>    5 	call	_skip_bits_long
> 
> Not inlining those get_bits etc will certainly slow things down,
> that's for sure.
> 
>> So -finline-limit can inline many functions in the object file which are
>> not part of alsdec.c. Which might be the reason for the performance
>> difference.
>>
>> But using -finline-limit does not yield a speed gain for the unpatched
>> file! So there might be something else but I don't see.
>>
>> The value of 4096 has been choosen randomly. As long as I don't know
>> exactly why -finline-limit removes the slowdown and that it cannot be
>> replaced by another approach, there is no need to figure out a more
>> optimal value...
> 
> We should do some benchmarks using that flag globally and see what
> happens.  Maybe we'd gain from using it everywhere.

Like Michael said, this would be a big test for different platforms and
compilers which I cannot offer alone so several people would have to do
this - if a benchmark would indicate that it might be worth testing.

Also, I'm lacking a good idea of how to test this efficiently without
having other factors like harddrives playing a predominant role which
means testing execution time of ffmpeg.

But does a common profit from this option makes it a good one to be
globally added? If yes, could we add this specifically to als for the
time being instead of holding back als decoder development completely?
Benchmarking and testing will surely take a lot of time...

-Thilo