[FFmpeg-devel] [PATCHv2] lavc/cbrt_tablegen: speed up tablegen

Ganesh Ajjanagadde gajjanag at mit.edu
Fri Jan 8 02:20:55 CET 2016


On Thu, Jan 7, 2016 at 4:48 PM, Michael Niedermayer
<michael at niedermayer.cc> wrote:
> On Mon, Jan 04, 2016 at 06:33:59PM -0800, Ganesh Ajjanagadde wrote:
>> This exploits an approach based on the sieve of Eratosthenes, a popular
>> method for generating prime numbers.
>>
>> Tables are identical to previous ones.
>>
>> Tested with FATE with/without --enable-hardcoded-tables.
>>
>> Sample benchmark (Haswell, GNU/Linux+gcc):
>> prev:
>> 7860100 decicycles in cbrt_tableinit,       1 runs,      0 skips
>> 7777490 decicycles in cbrt_tableinit,       2 runs,      0 skips
>> [...]
>> 7582339 decicycles in cbrt_tableinit,     256 runs,      0 skips
>> 7563556 decicycles in cbrt_tableinit,     512 runs,      0 skips
>>
>> new:
>> 2099480 decicycles in cbrt_tableinit,       1 runs,      0 skips
>> 2044470 decicycles in cbrt_tableinit,       2 runs,      0 skips
>> [...]
>> 1796544 decicycles in cbrt_tableinit,     256 runs,      0 skips
>> 1791631 decicycles in cbrt_tableinit,     512 runs,      0 skips
>>
>> Both small and large run count given as this is called once so small run
>> count may give a better picture, small numbers are fairly consistent,
>> and there is a consistent downward trend from small to large runs,
>> at which point it stabilizes to a new value.
>>
>> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
>> ---
>>  libavcodec/aacdec_fixed.c           |  4 +--
>>  libavcodec/aacdec_template.c        |  2 +-
>>  libavcodec/cbrt_tablegen.h          | 53 ++++++++++++++++++++++++++-----------
>>  libavcodec/cbrt_tablegen_template.c | 12 ++++++++-
>>  4 files changed, 51 insertions(+), 20 deletions(-)
>>
>> diff --git a/libavcodec/aacdec_fixed.c b/libavcodec/aacdec_fixed.c
>> index 396a874..f7b882b 100644
>> --- a/libavcodec/aacdec_fixed.c
>> +++ b/libavcodec/aacdec_fixed.c
>> @@ -155,9 +155,9 @@ static void vector_pow43(int *coefs, int len)
>>      for (i=0; i<len; i++) {
>>          coef = coefs[i];
>>          if (coef < 0)
>> -            coef = -(int)cbrt_tab[-coef];
>> +            coef = -(int)cbrt_tab[-coef].i;
>>          else
>> -            coef = (int)cbrt_tab[coef];
>> +            coef = (int)cbrt_tab[coef].i;
>>          coefs[i] = coef;
>>      }
>>  }
>> diff --git a/libavcodec/aacdec_template.c b/libavcodec/aacdec_template.c
>> index d819958..1380510 100644
>> --- a/libavcodec/aacdec_template.c
>> +++ b/libavcodec/aacdec_template.c
>> @@ -1791,7 +1791,7 @@ static int decode_spectrum_and_dequant(AACContext *ac, INTFLOAT coef[1024],
>>                                          v = -v;
>>                                      *icf++ = v;
>>  #else
>> -                                    *icf++ = cbrt_tab[n] | (bits & 1U<<31);
>> +                                    *icf++ = cbrt_tab[n].i | (bits & 1U<<31);
>>  #endif /* USE_FIXED */
>>                                      bits <<= 1;
>>                                  } else {
>> diff --git a/libavcodec/cbrt_tablegen.h b/libavcodec/cbrt_tablegen.h
>> index 59b5a1d..e3d6634 100644
>> --- a/libavcodec/cbrt_tablegen.h
>> +++ b/libavcodec/cbrt_tablegen.h
>> @@ -26,14 +26,13 @@
>>  #include <stdint.h>
>>  #include <math.h>
>>  #include "libavutil/attributes.h"
>> +#include "libavutil/intfloat.h"
>>  #include "libavcodec/aac_defines.h"
>>
>> -#if USE_FIXED
>> -#define CBRT(x) lrint((x).f * 8192)
>> -#else
>> -#define CBRT(x) x.i
>> -#endif
>> -
>
>> +union ff_int32float64 {
>> +    uint32_t i;
>> +    double   f;
>> +};
>>  #if CONFIG_HARDCODED_TABLES
>>  #if USE_FIXED
>>  #define cbrt_tableinit_fixed()
>> @@ -43,20 +42,42 @@
>>  #include "libavcodec/cbrt_tables.h"
>>  #endif
>>  #else
>> -static uint32_t cbrt_tab[1 << 13];
>> +static union ff_int32float64 cbrt_tab[1 << 13];
>
> this doubles the size of the cpu cache needed at runtime to store
> the same number of elements

Yes, it does, and it was a tradeoff I made that I forgot to list. One
can of course use floats; but this loses accuracy at significant
levels.

So one could malloc and free a double precision array (for temporary
storage) at costs of some code complexity, possible heap
fragmentation, and the problem of possible failure (may be ok since
anyway aac_decode_init is not guaranteed to succeed; it allocates
memory for the dsp context). Malloc/free is AFAIK ~ 100's of cycles,
dwarfed by the table generation cost.

The problem is that it is impossible to give an answer as to precisely
what impact that will have on decoding/encoding performance, and
results of course vary based on hardware. This is the same problem
that plagues static/dynamic table performance analysis.

I don't have a measurable performance regression on my machine for aac
decoding because of this. But then, my Haswell setup is not exactly
representative.

>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> it is not once nor twice but times without number that the same ideas make
> their appearance in the world. -- Aristotle
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>


More information about the ffmpeg-devel mailing list