[FFmpeg-devel] [PATCH/RFC] Add some dsputil functions useful for AAC decoder
Måns Rullgård
mans
Sun Sep 20 16:02:37 CEST 2009
Robert Swain <robert.swain at gmail.com> writes:
> Hello,
>
> 2009/9/20 M?ns Rullg?rd <mans at mansr.com>:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>>> On Fri, Sep 18, 2009 at 11:11:55PM +0100, Mans Rullgard wrote:
>>>> This patch adds a few dsputil functions that can be used in the AAC
>>>> decoder.
>>>>
>>>> With trivial NEON versions of these functions, the AAC decoder gets
>>>> ~1.6x faster on Cortex-A8, and better NEON code will push that even
>>>> further.
>>>>
>>>> I will readily admit that some of the names in this patch are rubbish,
>>>> so please suggest something better. ?Other enhancements are obviously
>>>> welcome too.
>>> [...]
>>>
>>>> diff --git a/libavcodec/dsputil.h b/libavcodec/dsputil.h
>>>> index d9d7d16..61252f5 100644
>>>> --- a/libavcodec/dsputil.h
>>>> +++ b/libavcodec/dsputil.h
>>>> @@ -397,6 +397,14 @@ typedef struct DSPContext {
>>>> ? ? ?/* assume len is a multiple of 8, and arrays are 16-byte aligned */
>>>> ? ? ?void (*int32_to_float_fmul_scalar)(float *dst, const int *src, float mul, int len);
>>>> ? ? ?void (*vector_clipf)(float *dst /* align 16 */, const float *src /* align 16 */, float min, float max, int len /* align 16 */);
>>>> + ? ?void (*vector_fmul_scalar)(float *dst, const float *src, float mul,
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int len);
>>>> + ? ?void (*vector_fmul_scalar_vp[2])(float *dst, const float *src,
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const float **vp, float mul, int len);
>>>> + ? ?void (*vp_fmul_scalar[2])(float *dst, const float **vp,
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?float mul, int len);
>
> vp means vector pair? How common are these operations?
I've no idea what it means. That's why I solicited suggestions for
better names.
>>>> + ? ?float (*scalarproduct_float)(const float *v1, const float *v2, int len);
>>>> + ? ?void (*butterflies_float)(float *v1, float *v2, int len);
>
> [...]
>
>>> also, without seeing how these all are used i do have the feeling that
>>> they maybe are too small primitives and that bigger chunks of aac code
>>> should be optimized to increase flexibility and reduce call overhead ...
>
> Why would optimising a larger chunk of code increase flexibility?
>
>> See attached patch.
>
> len can be calculated just inside the for () loop over i.
That's a minor detail. Does the overall approach make sense to you?
>>> and i would suggest to only optimize code when it matters speedwise and
>>> not when the code just makes up <1% of the cpu time, alex reply made
>>> me think that this may apply to some code in there ...
>>
>> 1.6x speedup matters to me.
>
> +1. But, what effect on performance does each function (or function
> type) permit?
I guess that depends on how the stream was encoded. Here's oprofile
output for one file on Cortex-A8 using the C version of these
functions:
samples % symbol name
1274 31.8261 decode_ics
676 16.8873 butterflies_float_c !!!
493 12.3158 vector_fmul_scalar_vp_2_c !!!
203 5.0712 fft_pass_neon
176 4.3967 ff_imdct_half_neon
169 4.2218 ff_vector_fmul_window_neon
150 3.7472 aac_decode_frame
138 3.4474 vector_fmul_scalar_c !!!
106 2.6480 vector_fmul_scalar_vp_4_c !!!
85 2.1234 fft16_neon
76 1.8986 ff_float_to_int16_interleave_neon
64 1.5988 vp_fmul_scalar_2_c !!!
41 1.0242 imdct_and_windowing
35 0.8743 output_packet
30 0.7494 fft8_neon
22 0.5496 av_rescale_rnd
22 0.5496 vp_fmul_scalar_4_c !!!
And here for another one:
samples % symbol name
940 24.7173 butterflies_float_c !!!
847 22.2719 decode_ics
344 9.0455 vector_fmul_scalar_vp_4_c !!!
288 7.5730 fft_pass_neon
221 5.8112 ff_imdct_half_neon
201 5.2853 ff_vector_fmul_window_neon
99 2.6032 vector_fmul_scalar_vp_2_c !!!
98 2.5769 ff_float_to_int16_interleave_neon
98 2.5769 fft16_neon
91 2.3928 aac_decode_frame
89 2.3403 vp_fmul_scalar_4_c !!!
60 1.5777 vp_fmul_scalar_2_c !!!
46 1.2096 fft8_neon
40 1.0518 av_encode
36 0.9466 imdct_and_windowing
30 0.7889 output_packet
19 0.4996 __divdi3
17 0.4470 __udivsi3
16 0.4207 vector_fmul_scalar_c !!!
As you can see, the relative time spent in these functions varies a
lot depending on the sample.
It is my opinion that everything which can be optimised should be
optimised.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list