[FFmpeg-devel] [PATCH] Extra build options for ALS (and others)

Alex Converse alex.converse
Thu Dec 10 20:33:00 CET 2009


On Thu, Dec 10, 2009 at 2:24 PM, Thilo Borgmann
<thilo.borgmann at googlemail.com> wrote:
> Am 02.12.09 12:52, schrieb Thilo Borgmann:
>> Thilo Borgmann schrieb:
>>> Michael Niedermayer schrieb:
>>>> On Mon, Nov 30, 2009 at 04:09:23PM +0100, Thilo Borgmann wrote:
>>>>> Thilo Borgmann schrieb:
>>>>>> M?ns Rullg?rd schrieb:
>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>
>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>
>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>> M?ns Rullg?rd schrieb:
>>>>>>>>>>>>> Thilo Borgmann <thilo.borgmann at googlemail.com> writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> recently the need for an extra build option for the ALS decoder arose.
>>>>>>>>>>>>> Is it impossible to achieve the desired outcome with some combination
>>>>>>>>>>>>> of always_inline, noinline, and flatten attributes?
>>>>>>>>>>>> No. See [PATCH] Split reading and decoding of blocks in ALS.
>>>>>>>>>>>>
>>>>>>>>>>>> Although I've managed to have the functions from the alsdec.c inlined
>>>>>>>>>>>> manually according to the grep'ed output of the assembler code, it seems
>>>>>>>>>>>> like it is not enough to manually inline functions from within that .c
>>>>>>>>>>>> file only using these technique.
>>>>>>>>>>> I'm confused. ?Can it be done in the C code only or not? ?This kind of
>>>>>>>>>>> issue should really not be solved in the makefile.
>>>>>>>>>> The issue is the big slowdown. The patch that causes this splits a big
>>>>>>>>>> function into two, which are then called successively.
>>>>>>>>>>
>>>>>>>>>> To overcome the slowdown issue, I inspected the functions being inlined
>>>>>>>>>> with and without the -finline-limit option. I can use av_always_inline
>>>>>>>>>> for many functions within alsdec.c to have the same functions inlined
>>>>>>>>>> like -finline-limit does.
>>>>>>>>>>
>>>>>>>>>> Unfortunately, using -finline-limit removes the slowdown introduced by
>>>>>>>>>> the patch while using av_always_inline does not.
>>>>>>>>> So it's not doing the same thing. ?What is it doing differently?
>>>>>>>>> Where did you get the limit number from?
>>>>>>>>>
>>>>>>>> All function calls within alsdec.s when using -finline-limit=4096:
>>>>>>>> ? ?1 ? ?call ? ?L1102
>>>>>>>> ? ?1 ? ?call ? ?L138
>>>>>>>> ? ?1 ? ?call ? ?L456
>>>>>>>> ? ?2 ? ?call ? ?L___udivdi3$stub
>>>>>>>> ? 10 ? ?call ? ?L_av_freep$stub
>>>>>>>> ? ?1 ? ?call ? ?L_av_get_bits_per_sample_format$stub
>>>>>>>> ? 12 ? ?call ? ?L_av_log$stub
>>>>>>>> ? ?5 ? ?call ? ?L_av_log_missing_feature$stub
>>>>>>>> ? ?8 ? ?call ? ?L_av_malloc$stub
>>>>>>>> ? ?2 ? ?call ? ?L_av_mallocz$stub
>>>>>>>> ? ?1 ? ?call ? ?L_ff_mpeg4audio_get_config$stub
>>>>>>>> ? ?6 ? ?call ? ?L_memcpy$stub
>>>>>>>> ? ?2 ? ?call ? ?L_memmove$stub
>>>>>>>> ? ?1 ? ?call ? ?L_memset$stub
>>>>>>>> ? ?2 ? ?call ? ?_decode_blocks_ind
>>>>>>>> ? ?4 ? ?call ? ?_decode_end
>>>>>>>> ? 36 ? ?call ? ?_decode_rice
>>>>>>>> ? 10 ? ?call ? ?_get_bits_long
>>>>>>>> ? 11 ? ?call ? ?_parse_bs_info
>>>>>>>> ? ?2 ? ?call ? ?_zero_remaining
>>>>>>>>
>>>>>>>> All function calls within alsdec.s when using many av_always_inline's.
>>>>>>>> This is designed to inline the same functions from alsdec.c like the
>>>>>>>> unpatched alsdec.c would yield without any extra build option:
>>>>>>>> ? ?1 ? ?call ? ?L1561
>>>>>>>> ? ?1 ? ?call ? ?L176
>>>>>>>> ? ?1 ? ?call ? ?L21
>>>>>>>> ? ?2 ? ?call ? ?L___udivdi3$stub
>>>>>>>> ? 10 ? ?call ? ?L_av_freep$stub
>>>>>>>> ? ?1 ? ?call ? ?L_av_get_bits_per_sample_format$stub
>>>>>>>> ? 13 ? ?call ? ?L_av_log$stub
>>>>>>>> ? ?5 ? ?call ? ?L_av_log_missing_feature$stub
>>>>>>>> ? ?8 ? ?call ? ?L_av_malloc$stub
>>>>>>>> ? ?2 ? ?call ? ?L_av_mallocz$stub
>>>>>>>> ? ?1 ? ?call ? ?L_ff_mpeg4audio_get_config$stub
>>>>>>>> ? ?1 ? ?call ? ?L_memcpy$stub
>>>>>>>> ? ?1 ? ?call ? ?L_memmove$stub
>>>>>>>> ? ?2 ? ?call ? ?L_memset$stub
>>>>>>>> ? ?8 ? ?call ? ?___inline_memcpy_chk
>>>>>>>> ? ?2 ? ?call ? ?___inline_memmove_chk
>>>>>>>> ? ?6 ? ?call ? ?_align_get_bits
>>>>>>>> ? ?5 ? ?call ? ?_av_ceil_log2
>>>>>>>> ? ?4 ? ?call ? ?_av_clip
>>>>>>>> ? ?4 ? ?call ? ?_decode_end
>>>>>>>> ? 47 ? ?call ? ?_get_bits
>>>>>>>> ? 90 ? ?call ? ?_get_bits1
>>>>>>>> ? ?3 ? ?call ? ?_get_bits_count
>>>>>>>> ? 61 ? ?call ? ?_get_bits_left
>>>>>>>> ? 39 ? ?call ? ?_get_bits_long
>>>>>>>> ? ?4 ? ?call ? ?_get_sbits_long
>>>>>>>> ? 60 ? ?call ? ?_get_unary
>>>>>>>> ? ?2 ? ?call ? ?_init_get_bits
>>>>>>>> ? ?3 ? ?call ? ?_parse_bs_info
>>>>>>>> ? ?3 ? ?call ? ?_read_time
>>>>>>>> ? ?7 ? ?call ? ?_skip_bits
>>>>>>>> ? ?2 ? ?call ? ?_skip_bits1
>>>>>>>> ? ?5 ? ?call ? ?_skip_bits_long
>>>>>>> Not inlining those get_bits etc will certainly slow things down,
>>>>>>> that's for sure.
>>>>>>>
>>>>>>>> So -finline-limit can inline many functions in the object file which are
>>>>>>>> not part of alsdec.c. Which might be the reason for the performance
>>>>>>>> difference.
>>>>>>>>
>>>>>>>> But using -finline-limit does not yield a speed gain for the unpatched
>>>>>>>> file! So there might be something else but I don't see.
>>>>>>>>
>>>>>>>> The value of 4096 has been choosen randomly. As long as I don't know
>>>>>>>> exactly why -finline-limit removes the slowdown and that it cannot be
>>>>>>>> replaced by another approach, there is no need to figure out a more
>>>>>>>> optimal value...
>>>>>>> We should do some benchmarks using that flag globally and see what
>>>>>>> happens. ?Maybe we'd gain from using it everywhere.
>>>>>> Like Michael said, this would be a big test for different platforms and
>>>>>> compilers which I cannot offer alone so several people would have to do
>>>>>> this - if a benchmark would indicate that it might be worth testing.
>>>>>>
>>>>>> Also, I'm lacking a good idea of how to test this efficiently without
>>>>>> having other factors like harddrives playing a predominant role which
>>>>>> means testing execution time of ffmpeg.
>>>>> I played around a little with the regression tests and audio decoders.
>>>>> For most of my tests -finline-limit=4096 makes it a little faster, e.g.
>>>>>
>>>>> g726: 47001535 dezicycles -> 41628457 dezicycles (12%)
>>>>> alac: 12855244 dezicycles -> 12849127 dezicycles ( 0%)
>>>>> flac: ? 842020 dezicycles -> ? 786226 dezicycles ( 7%)
>>>>> wma: ? 3663166 dezicycles -> ?3197273 dezicycles (14%)
>>>>>
>>>>> which is not surprising. Inlining comes for a price, ffmpeg executable
>>>>> growed from 5,4 MB to 6.1 MB.
>>>>> Value used fro -finline-limit is 4096, default is 600 for gcc-4.0.
>>>> what about video codecs? h264, mpeg4, mpeg2 h263 ?
>>>
>>> Can do tomorrow.
>>
>> h.261: 34067354 dezicycles -> 33048969 dezicycles ( 3%)
>> h.263: 32138793 dezicycles -> 30895187 dezicycles ( 4%)
>>
>> For h.264 we are using external libraries and there seems not to be a
>> regression test on these? (set timer in libx264.c and h264.c and got no
>> measurements)
>>
>> Anyway, the video regression tests yield dezicycle measures for around
>> 32 runs which are not really stable...
>> I tested h263 with a longer video and ended at 512 runs with 1% more
>> dezicycles needed - so slightly worse in fact.
>>
>> So I got the impression that the video decoders do not profit from that
>> compiler option in a way the audio decoders do. Why that is the case, is
>> another question though.
>
> If noone is still looking at this anymore, I assume this patch being
> rejected?
>

Have you considered doing a gcc version check and using #pragma
optimize in alsdec.c?



More information about the ffmpeg-devel mailing list