[FFmpeg-devel] [PATCH/RFC] Add some dsputil functions useful for AAC decoder

Sun Sep 20 16:02:37 CEST 2009

Robert Swain <robert.swain at gmail.com> writes:

> Hello,
>
> 2009/9/20 M?ns Rullg?rd <mans at mansr.com>:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>>> On Fri, Sep 18, 2009 at 11:11:55PM +0100, Mans Rullgard wrote:
>>>> This patch adds a few dsputil functions that can be used in the AAC
>>>> decoder.
>>>>
>>>> With trivial NEON versions of these functions, the AAC decoder gets
>>>> ~1.6x faster on Cortex-A8, and better NEON code will push that even
>>>> further.
>>>>
>>>> I will readily admit that some of the names in this patch are rubbish,
>>>> so please suggest something better. ?Other enhancements are obviously
>>>> welcome too.
>>> [...]
>>>
>>>> diff --git a/libavcodec/dsputil.h b/libavcodec/dsputil.h
>>>> index d9d7d16..61252f5 100644
>>>> --- a/libavcodec/dsputil.h
>>>> +++ b/libavcodec/dsputil.h
>>>> @@ -397,6 +397,14 @@ typedef struct DSPContext {
>>>> ? ? ?/* assume len is a multiple of 8, and arrays are 16-byte aligned */
>>>> ? ? ?void (*int32_to_float_fmul_scalar)(float *dst, const int *src, float mul, int len);
>>>> ? ? ?void (*vector_clipf)(float *dst /* align 16 */, const float *src /* align 16 */, float min, float max, int len /* align 16 */);
>>>> + ? ?void (*vector_fmul_scalar)(float *dst, const float *src, float mul,
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int len);
>>>> + ? ?void (*vector_fmul_scalar_vp[2])(float *dst, const float *src,
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const float **vp, float mul, int len);
>>>> + ? ?void (*vp_fmul_scalar[2])(float *dst, const float **vp,
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?float mul, int len);
>
> vp means vector pair? How common are these operations?

I've no idea what it means.  That's why I solicited suggestions for
better names.

>>>> + ? ?float (*scalarproduct_float)(const float *v1, const float *v2, int len);
>>>> + ? ?void (*butterflies_float)(float *v1, float *v2, int len);
>
> [...]
>
>>> also, without seeing how these all are used i do have the feeling that
>>> they maybe are too small primitives and that bigger chunks of aac code
>>> should be optimized to increase flexibility and reduce call overhead ...
>
> Why would optimising a larger chunk of code increase flexibility?
>
>> See attached patch.
>
> len can be calculated just inside the for () loop over i.

That's a minor detail.  Does the overall approach make sense to you?

>>> and i would suggest to only optimize code when it matters speedwise and
>>> not when the code just makes up <1% of the cpu time, alex reply made
>>> me think that this may apply to some code in there ...
>>
>> 1.6x speedup matters to me.
>
> +1. But, what effect on performance does each function (or function
> type) permit?

I guess that depends on how the stream was encoded.  Here's oprofile
output for one file on Cortex-A8 using the C version of these
functions:

samples  %        symbol name
1274     31.8261  decode_ics
676      16.8873  butterflies_float_c                   !!!
493      12.3158  vector_fmul_scalar_vp_2_c             !!!
203       5.0712  fft_pass_neon
176       4.3967  ff_imdct_half_neon
169       4.2218  ff_vector_fmul_window_neon
150       3.7472  aac_decode_frame
138       3.4474  vector_fmul_scalar_c                  !!!
106       2.6480  vector_fmul_scalar_vp_4_c             !!!
85        2.1234  fft16_neon
76        1.8986  ff_float_to_int16_interleave_neon
64        1.5988  vp_fmul_scalar_2_c                    !!!
41        1.0242  imdct_and_windowing
35        0.8743  output_packet
30        0.7494  fft8_neon
22        0.5496  av_rescale_rnd
22        0.5496  vp_fmul_scalar_4_c                    !!!

And here for another one:

samples  %        symbol name
940      24.7173  butterflies_float_c                   !!!
847      22.2719  decode_ics
344       9.0455  vector_fmul_scalar_vp_4_c             !!!
288       7.5730  fft_pass_neon
221       5.8112  ff_imdct_half_neon
201       5.2853  ff_vector_fmul_window_neon
99        2.6032  vector_fmul_scalar_vp_2_c             !!!
98        2.5769  ff_float_to_int16_interleave_neon
98        2.5769  fft16_neon
91        2.3928  aac_decode_frame
89        2.3403  vp_fmul_scalar_4_c                    !!!
60        1.5777  vp_fmul_scalar_2_c                    !!!
46        1.2096  fft8_neon
40        1.0518  av_encode
36        0.9466  imdct_and_windowing
30        0.7889  output_packet
19        0.4996  __divdi3
17        0.4470  __udivsi3
16        0.4207  vector_fmul_scalar_c                  !!!

As you can see, the relative time spent in these functions varies a
lot depending on the sample.

It is my opinion that everything which can be optimised should be
optimised.

-- 
M?ns Rullg?rd
mans at mansr.com