[FFmpeg-devel] [PATCH] Split reading and decoding of blocks in ALS

Thu Dec 10 20:33:27 CET 2009

Am 26.11.09 18:00, schrieb Michael Niedermayer:
> On Thu, Nov 26, 2009 at 10:40:56AM +0100, Thilo Borgmann wrote:
>> Michael Niedermayer schrieb:
>>> On Wed, Nov 25, 2009 at 01:26:57PM +0100, Thilo Borgmann wrote:
>>>> Michael Niedermayer schrieb:
>>>>> On Mon, Nov 23, 2009 at 06:24:00PM -0500, Justin Ruggles wrote:
>>>>>> Thilo Borgmann wrote:
>>>>>>
>>>>>>> Michael Niedermayer schrieb:
>>>>>>>> On Mon, Nov 23, 2009 at 11:28:35AM +0100, Thilo Borgmann wrote:
>>>>>>>>> Michael Niedermayer schrieb:
>>>>>>>>>> On Sat, Nov 14, 2009 at 02:15:30PM +0100, Thilo Borgmann wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> in order to support multi-channel correlation, reading and decoding of a
>>>>>>>>>>> block has to be separated. This patch introduces ALSBockData struct for
>>>>>>>>>>> that purpose.
>>>>>>>>>> [...]
>>>>>>>>>>> +/** Decodes the block data for a constant block
>>>>>>>>>>> + */
>>>>>>>>>>> +static void decode_const_block_data(ALSDecContext *ctx, ALSBlockData *bd)
>>>>>>>>>>> +{
>>>>>>>>>>> +    int smp;
>>>>>>>>>>> +
>>>>>>>>>>>      // write raw samples into buffer
>>>>>>>>>>> -    for (k = 0; k < block_length; k++)
>>>>>>>>>>> -        raw_samples[k] = const_val;
>>>>>>>>>>> +    for (smp = 0; smp < bd->block_length; smp++)
>>>>>>>>>>> +        bd->raw_samples[smp] = bd->const_val;
>>>>>>>>>>>  }
>>>>>>>>>> memset32(dst, val, len){
>>>>>>>>>> }
>>>>>>>>>> would possibly be faster due to not doing bd-> in the loop
>>>>>>>>> Ok will do...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [...]
>>>>>>>>>>> @@ -553,115 +579,132 @@
>>>>>>>>>>>      }
>>>>>>>>>>>  
>>>>>>>>>>>      if (get_bits1(gb))
>>>>>>>>>>> -        *shift_lsbs = get_bits(gb, 4) + 1;
>>>>>>>>>>> +        bd->shift_lsbs = get_bits(gb, 4) + 1;
>>>>>>>>>>>  
>>>>>>>>>>> -    store_prev_samples = (*js_blocks && raw_other) || *shift_lsbs;
>>>>>>>>>>> +    bd->store_prev_samples = (bd->js_blocks && bd->raw_other) || bd->shift_lsbs;
>>>>>>>>>>>  
>>>>>>>>>>> -
>>>>>>>>>>>      if (!sconf->rlslms) {
>>>>>>>>>>>          if (sconf->adapt_order) {
>>>>>>>>>>> -            int opt_order_length = av_ceil_log2(av_clip((block_length >> 3) - 1,
>>>>>>>>>>> +            int opt_order_length = av_ceil_log2(av_clip((bd->block_length >> 3) - 1,
>>>>>>>>>>>                                                  2, sconf->max_order + 1));
>>>>>>>>>>> -            opt_order            = get_bits(gb, opt_order_length);
>>>>>>>>>>> +            bd->opt_order        = get_bits(gb, opt_order_length);
>>>>>>>>>>>          } else {
>>>>>>>>>>> -            opt_order = sconf->max_order;
>>>>>>>>>>> +            bd->opt_order = sconf->max_order;
>>>>>>>>>>>          }
>>>>>>>>>>>  
>>>>>>>>>>> -        if (opt_order) {
>>>>>>>>>>> +        if (bd->opt_order) {
>>>>>>>>>>>              int add_base;
>>>>>>>>>>>  
>>>>>>>>>>>              if (sconf->coef_table == 3) {
>>>>>>>>>>>                  add_base = 0x7F;
>>>>>>>>>>>  
>>>>>>>>>>>                  // read coefficient 0
>>>>>>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>>>>  
>>>>>>>>>>>                  // read coefficient 1
>>>>>>>>>>> -                if (opt_order > 1)
>>>>>>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>>>> +                if (bd->opt_order > 1)
>>>>>>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>>>>>  
>>>>>>>>>>>                  // read coefficients 2 to opt_order
>>>>>>>>>>> -                for (k = 2; k < opt_order; k++)
>>>>>>>>>>> -                    quant_cof[k] = get_bits(gb, 7);
>>>>>>>>>>> +                for (k = 2; k < bd->opt_order; k++)
>>>>>>>>>>> +                    bd->quant_cof[k] = get_bits(gb, 7);
>>>>>>>>>>>              } else {
>>>>>>>>>>>                  int k_max;
>>>>>>>>>>>                  add_base = 1;
>>>>>>>>>>>  
>>>>>>>>>>>                  // read coefficient 0 to 19
>>>>>>>>>>> -                k_max = FFMIN(opt_order, 20);
>>>>>>>>>>> +                k_max = FFMIN(bd->opt_order, 20);
>>>>>>>>>>>                  for (k = 0; k < k_max; k++) {
>>>>>>>>>>>                      int rice_param = parcor_rice_table[sconf->coef_table][k][1];
>>>>>>>>>>>                      int offset     = parcor_rice_table[sconf->coef_table][k][0];
>>>>>>>>>>> -                    quant_cof[k] = decode_rice(gb, rice_param) + offset;
>>>>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, rice_param) + offset;
>>>>>>>>>>>                  }
>>>>>>>>>>>  
>>>>>>>>>>>                  // read coefficients 20 to 126
>>>>>>>>>>> -                k_max = FFMIN(opt_order, 127);
>>>>>>>>>>> +                k_max = FFMIN(bd->opt_order, 127);
>>>>>>>>>>>                  for (; k < k_max; k++)
>>>>>>>>>>> -                    quant_cof[k] = decode_rice(gb, 2) + (k & 1);
>>>>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 2) + (k & 1);
>>>>>>>>>>>  
>>>>>>>>>>>                  // read coefficients 127 to opt_order
>>>>>>>>>>> -                for (; k < opt_order; k++)
>>>>>>>>>>> -                    quant_cof[k] = decode_rice(gb, 1);
>>>>>>>>>>> +                for (; k < bd->opt_order; k++)
>>>>>>>>>>> +                    bd->quant_cof[k] = decode_rice(gb, 1);
>>>>>>>>>>>  
>>>>>>>>>>> -                quant_cof[0] = 32 * parcor_scaled_values[quant_cof[0] + 64];
>>>>>>>>>>> +                bd->quant_cof[0] = 32 * parcor_scaled_values[bd->quant_cof[0] + 64];
>>>>>>>>>>>  
>>>>>>>>>>> -                if (opt_order > 1)
>>>>>>>>>>> -                    quant_cof[1] = -32 * parcor_scaled_values[quant_cof[1] + 64];
>>>>>>>>>>> +                if (bd->opt_order > 1)
>>>>>>>>>>> +                    bd->quant_cof[1] = -32 * parcor_scaled_values[bd->quant_cof[1] + 64];
>>>>>>>>>>>              }
>>>>>>>>>>>  
>>>>>>>>>>> -            for (k = 2; k < opt_order; k++)
>>>>>>>>>>> -                quant_cof[k] = (quant_cof[k] << 14) + (add_base << 13);
>>>>>>>>>>> +            for (k = 2; k < bd->opt_order; k++)
>>>>>>>>>>> +                bd->quant_cof[k] = (bd->quant_cof[k] << 14) + (add_base << 13);
>>>>>>>>>> it might make sense to have commonly used variables like opt_order on the
>>>>>>>>>> stack intead of just accessable through a pointer
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>> besides does this patch lead to any slowdown?
>>>>>>>>> The benchmark results showed a little slowdown but at least hardly
>>>>>>>>> noticable on the command line...
>>>>>>>>> We had this issue at the soc list. Ended up with me providing some
>>>>>>>>> assembler file grep results for checking inlining done by the compiler.
>>>>>>>> was the slowdown fixed or not?
>>>>>>>> "not" is bad
>>>>>>> It was not. You blamed the many bd-> to be a reason for the slowdown and
>>>>>>> requested a check for wrong inlining before considering other
>>>>>>> possibilities to make this faster. I'm not able to see anything about
>>>>>>> the inlining stuff in assembler files so I provided a grep result you
>>>>>>> asked for.
>>>>>>>
>>>>>>> See:
>>>>>>> https://lists.mplayerhq.hu/pipermail/ffmpeg-soc/2009-October/008364.html
>>>>>>>
>>>>>>> I'm ready to fix that slowdown issue, all I need is a little advise
>>>>>>> about how to... the obvious way to use local variables to store
>>>>>>> bd->something and removing many dereferences that way did not yield a
>>>>>>> significant reduction of the slowdown.
>>>>>> grepping for "call" and removing duplicates gives:
>>>>>>
>>>>>> combined:
>>>>>>         call    _parse_bs_info
>>>>>>         call    _decode_end
>>>>>>         call    _read_block_data
>>>>>>         call    _zero_remaining
>>>>>>
>>>>>> separate:
>>>>>>         call    _parse_bs_info
>>>>>>         call    _decode_end
>>>>>>         call    _decode_var_block_data
>>>>>>         call    _zero_remaining
>>>>>>         call    _read_var_block_data
>>>>>>         call    _decode_blocks_ind
>>>>>>
>>>>>> It seems that decode_blocks_ind is inlined in the combined version but
>>>>>> not in the separate version.
>>>> Thanks for that!
>>>>
>>>>> so, does adding always_inline help?
>>>>> (note, expect that adding this to one function will make gcc stop
>>>>>  inlining other unrelated functions so you might need many always_inline
>>>>>  to force gcc not to be silly)
>>>> I've redone all the benchmarking and attached the results in a text
>>>> file. The benchmarked stops time for the call to read_frame_data() in
>>>> decode_frame(). I've printed the functions with av_always_inline, the
>>>> benchmark results and the grep result for calls in the assembler code.
>>>> All done using gcc 4.0.
>>>>
>>>> Unfortunately, using av_always_inline for many functions (so that the
>>>> same calls within alsdec.c remain) reveals many other functions not
>>>> inlined anymore and therefore seems not to reduce the slowdown. And as
>>>> you said, just av_always_inline'ing decode_blocks_ind lets many other
>>>> functione not being inlined any more...
>>>>
>>>> Anything else I can test?
>>>
>>> There are varous command line switches for gcc that control inlining
>>> also
>>> You could report this to the gcc devels, not sure what that would lead
>>> to but i guess it might be worth a try.
>>>
>>> What is certain though is that code split / factorization that reduces
>>> speed by 10% is rejected
>>
>> Compilation of aldec.c is done using -O3 which enables the common
>> inlining optimizations. One other is untouched by this, -finline-limit.
>> I could not figure out what the default value for this is, but I tested
>> the seperate version (pure patch) without using av_always_inline
>> anywhere. I set it by chance to 4096 and it ran as fast as the combined
>> version.
>>
>> 8028640 dezicycles in sep, 1 runs, 0 skips
>> 9019255 dezicycles in sep, 2 runs, 0 skips
>> 12334895 dezicycles in sep, 4 runs, 0 skips
>> 15384680 dezicycles in sep, 8 runs, 0 skips
>> 15415286 dezicycles in sep, 16 runs, 0 skips
>> 15224882 dezicycles in sep, 32 runs, 0 skips
>> 15370329 dezicycles in sep, 64 runs, 0 skips
>>
>>    2 	call	_decode_blocks_ind
>>    4 	call	_decode_end
>>   36 	call	_decode_rice
>>   10 	call	_get_bits_long
>>   11 	call	_parse_bs_info
>>    2 	call	_zero_remaining
>>
>> The assembler output points out, that decode_blocks_ind() is still not
>> inlines without av_always_inline, but {read,decode}_var_block_data() are
>> (compare with qouted output above).
>>
>> Using -finline_limit with the combined version does not yield any speed
>> gain.
>>
> 
>> So, is using -finline_limit an option? If so, someone with more
>> experience with the gcc might suggest a better values than 4096 for it.
> 
> its an option if you can limit it to the file that needs it, global changes
> would need alot of testing (nothing wrong with global change if it is
> globally faster and people do the testing of many codecs many all cpus and
> compilers but thats a big task ...)

Playing around with the inlining options doesn't seem to have enough
support to be added like the other thread about that revealed...

So I started looking for code-based alternatives and luckily the first
trivial approach looks promising to me.
To the patched file, I added a wrapper function read_decode_frame()
which does successively call read_frame() and decode_frame(). Also, I
changed the occurencies of if(read() || decode()) into if(read_decode()).

At least my gcc 4.0 seems to be able to inline the function again
without adapting the inlining parameters in that case.

So minimal overhead and no slowdown anymore:

current trunk:
27230413 dezicycles in comb, 64 runs, 0 skips
27580758 dezicycles in comb, 64 runs, 0 skips
27221314 dezicycles in comb, 64 runs, 0 skips

old proposed patch als_splitreaddecode.patch:
30033126 dezicycles in separate, 64 runs, 0 skips
30331049 dezicycles in separate, 64 runs, 0 skips
31350367 dezicycles in separate, 64 runs, 0 skips

new proposed approach:
27530408 dezicycles in read_decode, 64 runs, 0 skips
27394679 dezicycles in read_decode, 64 runs, 0 skips
27122675 dezicycles in read_decode, 64 runs, 0 skips

I added a patch for clarification, but it still does not include the
other issues brought up here, so this is not for appliance yet.

Would this approach be ok?

-Thilo