[FFmpeg-devel] [PATCH] Split reading and decoding of blocks in ALS
Thilo Borgmann
thilo.borgmann
Wed Nov 25 13:26:57 CET 2009
Michael Niedermayer schrieb:
> On Mon, Nov 23, 2009 at 06:24:00PM -0500, Justin Ruggles wrote:
>> Thilo Borgmann wrote:
>>
>>> Michael Niedermayer schrieb:
>>>> On Mon, Nov 23, 2009 at 11:28:35AM +0100, Thilo Borgmann wrote:
>>>>> Michael Niedermayer schrieb:
>>>>>> On Sat, Nov 14, 2009 at 02:15:30PM +0100, Thilo Borgmann wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> in order to support multi-channel correlation, reading and decoding of a
>>>>>>> block has to be separated. This patch introduces ALSBockData struct for
>>>>>>> that purpose.
>>>>>> [...]
>>>>>>> +/** Decodes the block data for a constant block
>>>>>>> + */
>>>>>>> +static void decode_const_block_data(ALSDecContext *ctx, ALSBlockData *bd)
>>>>>>> +{
>>>>>>> + int smp;
>>>>>>> +
>>>>>>> // write raw samples into buffer
>>>>>>> - for (k = 0; k < block_length; k++)
>>>>>>> - raw_samples[k] = const_val;
>>>>>>> + for (smp = 0; smp < bd->block_length; smp++)
>>>>>>> + bd->raw_samples[smp] = bd->const_val;
>>>>>>> }
>>>>>> memset32(dst, val, len){
>>>>>> }
>>>>>> would possibly be faster due to not doing bd-> in the loop
>>>>> Ok will do...
>>>>>
>>>>>
>>>>>> [...]
>>>>>>> @@ -553,115 +579,132 @@
>>>>>>> }
>>>>>>>
>>>>>>> if (get_bits1(gb))
>>>>>>> - *shift_lsbs = get_bits(gb, 4) + 1;
>>>>>>> + bd->shift_lsbs = get_bits(gb, 4) + 1;
>>>>>>>
>>>>>>> - store_prev_samples = (*js_blocks && raw_other) || *shift_lsbs;
>>>>>>> + bd->store_prev_samples = (bd->js_blocks && bd->raw_other) || bd->shift_lsbs;
>>>>>>>
>>>>>>> -
>>>>>>> if (!sconf->rlslms) {
>>>>>>> if (sconf->adapt_order) {
>>>>>>> - int opt_order_length = av_ceil_log2(av_clip((block_length >> 3) - 1,
>>>>>>> + int opt_order_length = av_ceil_log2(av_clip((bd->block_length >> 3) - 1,
>>>>>>> 2, sconf->max_order + 1));
>>>>>>> - opt_order = get_bits(gb, opt_order_length);
>>>>>>> + bd->opt_order = get_bits(gb, opt_order_length);
>>>>>>> } else {
>>>>>>> - opt_order = sconf->max_order;
>>>>>>> + bd->opt_order = sconf->max_order;
>>>>>>> }
>>>>>>>
>>>>>>> - if (opt_order) {
>>>>>>> + if (bd->opt_order) {
>>>>>>> int add_base;
>>>>>>>
>>>>>>> if (sconf->coef_table == 3) {
>>>>>>> add_base = 0x7F;
>>>>>>>
>>>>>>> // read coefficient 0
>>>>>>> - quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>> + bd->quant_cof[0] = 32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>
>>>>>>> // read coefficient 1
>>>>>>> - if (opt_order > 1)
>>>>>>> - quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>> + if (bd->opt_order > 1)
>>>>>>> + bd->quant_cof[1] = -32 * parcor_scaled_values[get_bits(gb, 7)];
>>>>>>>
>>>>>>> // read coefficients 2 to opt_order
>>>>>>> - for (k = 2; k < opt_order; k++)
>>>>>>> - quant_cof[k] = get_bits(gb, 7);
>>>>>>> + for (k = 2; k < bd->opt_order; k++)
>>>>>>> + bd->quant_cof[k] = get_bits(gb, 7);
>>>>>>> } else {
>>>>>>> int k_max;
>>>>>>> add_base = 1;
>>>>>>>
>>>>>>> // read coefficient 0 to 19
>>>>>>> - k_max = FFMIN(opt_order, 20);
>>>>>>> + k_max = FFMIN(bd->opt_order, 20);
>>>>>>> for (k = 0; k < k_max; k++) {
>>>>>>> int rice_param = parcor_rice_table[sconf->coef_table][k][1];
>>>>>>> int offset = parcor_rice_table[sconf->coef_table][k][0];
>>>>>>> - quant_cof[k] = decode_rice(gb, rice_param) + offset;
>>>>>>> + bd->quant_cof[k] = decode_rice(gb, rice_param) + offset;
>>>>>>> }
>>>>>>>
>>>>>>> // read coefficients 20 to 126
>>>>>>> - k_max = FFMIN(opt_order, 127);
>>>>>>> + k_max = FFMIN(bd->opt_order, 127);
>>>>>>> for (; k < k_max; k++)
>>>>>>> - quant_cof[k] = decode_rice(gb, 2) + (k & 1);
>>>>>>> + bd->quant_cof[k] = decode_rice(gb, 2) + (k & 1);
>>>>>>>
>>>>>>> // read coefficients 127 to opt_order
>>>>>>> - for (; k < opt_order; k++)
>>>>>>> - quant_cof[k] = decode_rice(gb, 1);
>>>>>>> + for (; k < bd->opt_order; k++)
>>>>>>> + bd->quant_cof[k] = decode_rice(gb, 1);
>>>>>>>
>>>>>>> - quant_cof[0] = 32 * parcor_scaled_values[quant_cof[0] + 64];
>>>>>>> + bd->quant_cof[0] = 32 * parcor_scaled_values[bd->quant_cof[0] + 64];
>>>>>>>
>>>>>>> - if (opt_order > 1)
>>>>>>> - quant_cof[1] = -32 * parcor_scaled_values[quant_cof[1] + 64];
>>>>>>> + if (bd->opt_order > 1)
>>>>>>> + bd->quant_cof[1] = -32 * parcor_scaled_values[bd->quant_cof[1] + 64];
>>>>>>> }
>>>>>>>
>>>>>>> - for (k = 2; k < opt_order; k++)
>>>>>>> - quant_cof[k] = (quant_cof[k] << 14) + (add_base << 13);
>>>>>>> + for (k = 2; k < bd->opt_order; k++)
>>>>>>> + bd->quant_cof[k] = (bd->quant_cof[k] << 14) + (add_base << 13);
>>>>>> it might make sense to have commonly used variables like opt_order on the
>>>>>> stack intead of just accessable through a pointer
>>>>> Yes.
>>>>>
>>>>>> besides does this patch lead to any slowdown?
>>>>> The benchmark results showed a little slowdown but at least hardly
>>>>> noticable on the command line...
>>>>> We had this issue at the soc list. Ended up with me providing some
>>>>> assembler file grep results for checking inlining done by the compiler.
>>>> was the slowdown fixed or not?
>>>> "not" is bad
>>> It was not. You blamed the many bd-> to be a reason for the slowdown and
>>> requested a check for wrong inlining before considering other
>>> possibilities to make this faster. I'm not able to see anything about
>>> the inlining stuff in assembler files so I provided a grep result you
>>> asked for.
>>>
>>> See:
>>> https://lists.mplayerhq.hu/pipermail/ffmpeg-soc/2009-October/008364.html
>>>
>>> I'm ready to fix that slowdown issue, all I need is a little advise
>>> about how to... the obvious way to use local variables to store
>>> bd->something and removing many dereferences that way did not yield a
>>> significant reduction of the slowdown.
>> grepping for "call" and removing duplicates gives:
>>
>> combined:
>> call _parse_bs_info
>> call _decode_end
>> call _read_block_data
>> call _zero_remaining
>>
>> separate:
>> call _parse_bs_info
>> call _decode_end
>> call _decode_var_block_data
>> call _zero_remaining
>> call _read_var_block_data
>> call _decode_blocks_ind
>>
>> It seems that decode_blocks_ind is inlined in the combined version but
>> not in the separate version.
Thanks for that!
>
> so, does adding always_inline help?
> (note, expect that adding this to one function will make gcc stop
> inlining other unrelated functions so you might need many always_inline
> to force gcc not to be silly)
I've redone all the benchmarking and attached the results in a text
file. The benchmarked stops time for the call to read_frame_data() in
decode_frame(). I've printed the functions with av_always_inline, the
benchmark results and the grep result for calls in the assembler code.
All done using gcc 4.0.
Unfortunately, using av_always_inline for many functions (so that the
same calls within alsdec.c remain) reveals many other functions not
inlined anymore and therefore seems not to reduce the slowdown. And as
you said, just av_always_inline'ing decode_blocks_ind lets many other
functione not being inlined any more...
Anything else I can test?
-Thilo
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: benchmark.txt
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20091125/efde23de/attachment.txt>
More information about the ffmpeg-devel
mailing list