[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)

Thu Dec 10 20:44:05 CET 2009

Am 10.12.09 20:33, schrieb Alex Converse:
> On Thu, Dec 10, 2009 at 2:24 PM, Thilo Borgmann
> <thilo.borgmann at googlemail.com> wrote:
>> Am 02.12.09 12:52, schrieb Thilo Borgmann:
>>> Thilo Borgmann schrieb:
>>>> Michael Niedermayer schrieb:
>>>>> On Mon, Nov 30, 2009 at 04:09:23PM +0100, Thilo Borgmann wrote:
>>>>>> Thilo Borgmann schrieb:
>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>
>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>
>>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>>
>>>>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>>>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>>>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>>>>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>>>>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>>>>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>>>>>>>>> file only using these technique.
>>>>>>>>>>>> I'm confused.  Can it be done in the C code only or not?  This kind of
>>>>>>>>>>>> issue should really not be solved in the makefile.
>>>>>>>>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>>>>>>>>> function into two, which are then called successively.
>>>>>>>>>>>
>>>>>>>>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>>>>>>>>> with and without the -finline-limit option. I can use av_always_inline
>>>>>>>>>>> for many functions within alsdec.c to have the same functions inlined
>>>>>>>>>>> like -finline-limit does.
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>>>>>>>>> the patch while using av_always_inline does not.
>>>>>>>>>> So it's not doing the same thing.  What is it doing differently?
>>>>>>>>>> Where did you get the limit number from?
>>>>>>>>>>
>>>>>>>>> All function calls within alsdec.s when using -finline-limit=4096:
>>>>>>>>>    1    call    L1102
>>>>>>>>>    1    call    L138
>>>>>>>>>    1    call    L456
>>>>>>>>>    2    call    L___udivdi3$stub
>>>>>>>>>   10    call    L_av_freep$stub
>>>>>>>>>    1    call    L_av_get_bits_per_sample_format$stub
>>>>>>>>>   12    call    L_av_log$stub
>>>>>>>>>    5    call    L_av_log_missing_feature$stub
>>>>>>>>>    8    call    L_av_malloc$stub
>>>>>>>>>    2    call    L_av_mallocz$stub
>>>>>>>>>    1    call    L_ff_mpeg4audio_get_config$stub
>>>>>>>>>    6    call    L_memcpy$stub
>>>>>>>>>    2    call    L_memmove$stub
>>>>>>>>>    1    call    L_memset$stub
>>>>>>>>>    2    call    _decode_blocks_ind
>>>>>>>>>    4    call    _decode_end
>>>>>>>>>   36    call    _decode_rice
>>>>>>>>>   10    call    _get_bits_long
>>>>>>>>>   11    call    _parse_bs_info
>>>>>>>>>    2    call    _zero_remaining
>>>>>>>>>
>>>>>>>>> All function calls within alsdec.s when using many av_always_inline's.
>>>>>>>>> This is designed to inline the same functions from alsdec.c like the
>>>>>>>>> unpatched alsdec.c would yield without any extra build option:
>>>>>>>>>    1    call    L1561
>>>>>>>>>    1    call    L176
>>>>>>>>>    1    call    L21
>>>>>>>>>    2    call    L___udivdi3$stub
>>>>>>>>>   10    call    L_av_freep$stub
>>>>>>>>>    1    call    L_av_get_bits_per_sample_format$stub
>>>>>>>>>   13    call    L_av_log$stub
>>>>>>>>>    5    call    L_av_log_missing_feature$stub
>>>>>>>>>    8    call    L_av_malloc$stub
>>>>>>>>>    2    call    L_av_mallocz$stub
>>>>>>>>>    1    call    L_ff_mpeg4audio_get_config$stub
>>>>>>>>>    1    call    L_memcpy$stub
>>>>>>>>>    1    call    L_memmove$stub
>>>>>>>>>    2    call    L_memset$stub
>>>>>>>>>    8    call    ___inline_memcpy_chk
>>>>>>>>>    2    call    ___inline_memmove_chk
>>>>>>>>>    6    call    _align_get_bits
>>>>>>>>>    5    call    _av_ceil_log2
>>>>>>>>>    4    call    _av_clip
>>>>>>>>>    4    call    _decode_end
>>>>>>>>>   47    call    _get_bits
>>>>>>>>>   90    call    _get_bits1
>>>>>>>>>    3    call    _get_bits_count
>>>>>>>>>   61    call    _get_bits_left
>>>>>>>>>   39    call    _get_bits_long
>>>>>>>>>    4    call    _get_sbits_long
>>>>>>>>>   60    call    _get_unary
>>>>>>>>>    2    call    _init_get_bits
>>>>>>>>>    3    call    _parse_bs_info
>>>>>>>>>    3    call    _read_time
>>>>>>>>>    7    call    _skip_bits
>>>>>>>>>    2    call    _skip_bits1
>>>>>>>>>    5    call    _skip_bits_long
>>>>>>>> Not inlining those get_bits etc will certainly slow things down,
>>>>>>>> that's for sure.
>>>>>>>>
>>>>>>>>> So -finline-limit can inline many functions in the object file which are
>>>>>>>>> not part of alsdec.c. Which might be the reason for the performance
>>>>>>>>> difference.
>>>>>>>>>
>>>>>>>>> But using -finline-limit does not yield a speed gain for the unpatched
>>>>>>>>> file! So there might be something else but I don't see.
>>>>>>>>>
>>>>>>>>> The value of 4096 has been choosen randomly. As long as I don't know
>>>>>>>>> exactly why -finline-limit removes the slowdown and that it cannot be
>>>>>>>>> replaced by another approach, there is no need to figure out a more
>>>>>>>>> optimal value...
>>>>>>>> We should do some benchmarks using that flag globally and see what
>>>>>>>> happens.  Maybe we'd gain from using it everywhere.
>>>>>>> Like Michael said, this would be a big test for different platforms and
>>>>>>> compilers which I cannot offer alone so several people would have to do
>>>>>>> this - if a benchmark would indicate that it might be worth testing.
>>>>>>>
>>>>>>> Also, I'm lacking a good idea of how to test this efficiently without
>>>>>>> having other factors like harddrives playing a predominant role which
>>>>>>> means testing execution time of ffmpeg.
>>>>>> I played around a little with the regression tests and audio decoders.
>>>>>> For most of my tests -finline-limit=4096 makes it a little faster, e.g.
>>>>>>
>>>>>> g726: 47001535 dezicycles -> 41628457 dezicycles (12%)
>>>>>> alac: 12855244 dezicycles -> 12849127 dezicycles ( 0%)
>>>>>> flac:   842020 dezicycles ->   786226 dezicycles ( 7%)
>>>>>> wma:   3663166 dezicycles ->  3197273 dezicycles (14%)
>>>>>>
>>>>>> which is not surprising. Inlining comes for a price, ffmpeg executable
>>>>>> growed from 5,4 MB to 6.1 MB.
>>>>>> Value used fro -finline-limit is 4096, default is 600 for gcc-4.0.
>>>>> what about video codecs? h264, mpeg4, mpeg2 h263 ?
>>>>
>>>> Can do tomorrow.
>>>
>>> h.261: 34067354 dezicycles -> 33048969 dezicycles ( 3%)
>>> h.263: 32138793 dezicycles -> 30895187 dezicycles ( 4%)
>>>
>>> For h.264 we are using external libraries and there seems not to be a
>>> regression test on these? (set timer in libx264.c and h264.c and got no
>>> measurements)
>>>
>>> Anyway, the video regression tests yield dezicycle measures for around
>>> 32 runs which are not really stable...
>>> I tested h263 with a longer video and ended at 512 runs with 1% more
>>> dezicycles needed - so slightly worse in fact.
>>>
>>> So I got the impression that the video decoders do not profit from that
>>> compiler option in a way the audio decoders do. Why that is the case, is
>>> another question though.
>>
>> If noone is still looking at this anymore, I assume this patch being
>> rejected?
>>
> 
> Have you considered doing a gcc version check and using #pragma
> optimize in alsdec.c?

Not yet. But this might be a better idea than changing the makefile if
it keeps being unavoidable for ALS. For most of the others, this seems
to lack the necessary support.

-Thilo