[FFmpeg-devel] [RFC] AAC encoder optimizations

Sun Sep 14 02:45:02 CEST 2008

On Sat, Sep 13, 2008 at 10:29:27AM +0300, Kostya wrote:
> Here's the next experimental version.
> I've replaced quantization and codebook determining functions
> and reworked scalefactor search code.
> Now it's ~30 time slower than realtime on Core2.

what about quality? 
Does it beat faac?
Does it reach the quality from that paper?

> 
> Please comment on those functions and give your suggestions.

Your code looks cleaner but i suspect its untested and more buggy
than before. I hope iam wrong, and appologize for the flaming comment if
i am, but if not then we arent moving any closer to svn, rather we move
in circles.
Its not funny to spend time thinking about and suggesting some optimizations
to correct code to see it replaced by totally different and wrong code.
So my main comment is
* total lack of communication, i really would appreciate to know what
  you are planing to change, how, why, if it has been tested, and what
  the results of the tests where.

[...]
> /**
>  * Calculate rate distortion cost for quantizing with given codebook
>  * 
>  * @return quantization distortion
>  */
> static float quantize_band_cost(const float *in, int size, int scale_idx, int cb,
>                                  const float lambda, const float uplim)
> {
>     const float Q = ff_aac_pow2sf_tab[200 + scale_idx - SCALE_ONE_POS + SCALE_DIV_512];
>     int i, j, k;
>     float cost = 0;
>     const int dim = cb < FIRST_PAIR_BT ? 4 : 2;
>     
> START_TIMER
>     if(!cb){
>         for(i = 0; i < size; i++)
>             cost += in[i]*in[i]*lambda;
>         return cost;
>     }
>     for(i = 0; i < size; i += dim){
>         float mincost = INFINITY;
>         int minidx = 0;
>         int minbits = 0;
>         const float *vec = ff_aac_codebook_vectors[cb-1];
>         for(j = 0; j < ff_aac_spectral_sizes[cb-1]; j++, vec += dim){
>             float rd = 0.0f;
>             int curbits = ff_aac_spectral_bits[cb-1][minidx];
>             if(IS_CODEBOOK_UNSIGNED(cb)){
>                 for(k = 0; k < dim; k++){
>                     float t = fabsf(in[i+k]);
>                     float di;
>                     //do not code with escape sequence small values
>                     if(vec[k] == 64.0f && t < 39.0f*Q){
>                         rd = INFINITY;
>                         break;
>                     }
>                     if(vec[k] == 64.0f){//FIXME: slow
>                         if(t >= 165140.0f*Q){ // clipped value
>                             di = t - 165140.0f;
>                             curbits += 21;
>                         }else{
>                             int c = quant(t, 1.0/Q);
>                             di = t - c*cbrt(c)*Q;
>                             curbits += av_log2(c)*2 - 4 + 1;
>                         }
>                     }else{
>                         di = t - vec[k]*Q;
>                     }
>                     if(vec[k] != 0.0f)
>                         curbits++;
>                     rd += di*di*lambda;
>                 }
>             }else{
>                 for(k = 0; k < dim; k++){
>                     float di = in[i+k] - vec[k]*Q;
>                     rd += di*di*lambda;
>                 }            
>             }
>             rd += curbits;
>             if(rd < mincost){
>                 mincost = rd;
>                 minidx = j;
>                 minbits = curbits;
>             }
>         }
>         cost += mincost;
>         if(cost >= uplim)
>             return uplim;

>         minbits = 0;
>         if(IS_CODEBOOK_UNSIGNED(cb))
>             for(j = 0; j < dim; j++)
>                 if(ff_aac_codebook_vectors[cb-1][minidx*2+j] != 0.0f)
>                     minbits++;
>         cost += minbits;

what is this doing? This looks strange
cost already includes the bits.

>     }
> STOP_TIMER("quantize_band")

I assume this code is eating alot of cpu time currently, so lets optimize
it, but please leave the bruteforce code for comparission so we can always
cross check that the optimized code is working correctly.

If we assume lambda and cb where fixed, which they actually are from the
point of view of this function then picking the best vector becomes rather
simple. You just need a look up table, the idea is

instead of
for(j = 0; j < ff_aac_spectral_sizes[cb-1]; j++, vec += dim){
    compare each
    pick best

it would be for dim=2
a= lrintf(in[i  ] * (C/Q));
b= lrintf(in[i+1] * (C/Q));
if(a<-T || a>T){
    rd += best_esc_rd[i  ];
    a=T+1;
}
if(b<-T || b>T){
    rd += best_esc_rd[i+1];
    b=T+1;
}
p= lut[a+T][b+T];
for(j = 0; j < p->count; j++, vec += dim){
    compare each
    pick best

and p->count should generally be a small number like 1-4.

best_esc_rd[] would contain the best rd for each coefficient assuming escape
coding.
One important fact here is that the best escape coded value does not
depend on the code book nor the vector within the code book, thus there
is no need to calculate it more than once for each coeff.
It could be calculated like you do currently possibly with trying
a value or 2 closer to 0 to see if they maybe have a lower RD

whats left is how to calculate the content of the lut and how to
make a small finite number of such tables for lambda, after all
each different lambda would seem to need its own table.

Well each entry in the look up table applies to a square 
(or hypercube for dim=4) of input coeff pairs (or quads) and would also
apply to a range of lambdas. Its content would be a list of cb vectors that
are RD optimal for at least one coeff pair/quad and lambda to which the
entry applies.

Iam not sure how simple it would be to calculate the content of the lut
exactly, though approximating it would be easy, just take a large number
of random coeff pairs/quads and lamdas, calculate the best vector by
the code you already have and add that best value to the corresponding
lut entry when its not already in there.

> 
>     return cost;
> }
> 
> /**
>  * Prepare coefficients for encoding.
>  * 
>  * @return sum of coefficient absolute values
>  */
> static void quantize_and_encode_band(PutBitContext *pb, const float *in, int size,
>                                       int scale_idx, int cb, const float lambda)
> {
>     const float Q  = ff_aac_pow2sf_tab[200 + scale_idx - SCALE_ONE_POS + SCALE_DIV_512];
>     const float IQ = ff_aac_pow2sf_tab[200 - scale_idx + SCALE_ONE_POS - SCALE_DIV_512];
>     int i, j, k;
>     const int dim = cb < FIRST_PAIR_BT ? 4 : 2;
>     if(!cb)
>         return;
> 
>     for(i = 0; i < size; i += dim){
>         float mincost = INFINITY;
>         int minidx = 0;
>         int minbits = 0;
>         const float *vec = ff_aac_codebook_vectors[cb-1];
>         for(j = 0; j < ff_aac_spectral_sizes[cb-1]; j++, vec += dim){
>             float rd = 0.0f;
>             int curbits = ff_aac_spectral_bits[cb-1][minidx];
>             if(IS_CODEBOOK_UNSIGNED(cb)){
>                 for(k = 0; k < dim; k++){
>                     float t = fabsf(in[i+k]);
>                     float di;
>                     if(vec[k] != 0.0f)
>                         curbits++;
>                     //do not code with escape sequence small values
>                     if(vec[k] == 64.0f && t < 39.0f*Q){
>                         rd = INFINITY;
>                         break;
>                     }
>                     if(vec[k] == 64.0f){//FIXME: slow
>                         if(t*IQ >= 165140.0f){ // clipped value
>                             di = t - 165140.0f;
>                             curbits += 21;
>                         }else{
>                             int c = quant(t, IQ);
>                             di = t - c*cbrt(c)*Q;
>                             curbits += av_log2(c)*2 - 4 + 1;
>                         }
>                     }else{
>                         di = t - vec[k]*Q;
>                     }
>                     rd += di*di*lambda;
>                 }
>             }else{
>                 for(k = 0; k < dim; k++){
>                     float di = in[i+k] - vec[k]*Q;
>                     rd += di*di*lambda;
>                 }            
>             }
>             rd += curbits;
>             if(rd < mincost){
>                 mincost = rd;
>                 minidx = j;
>                 minbits = curbits;
>             }
>         }
>         put_bits(pb, ff_aac_spectral_bits[cb-1][minidx], ff_aac_spectral_codes[cb-1][minidx]);
>         if(IS_CODEBOOK_UNSIGNED(cb))
>             for(j = 0; j < dim; j++)
>                 if(ff_aac_codebook_vectors[cb-1][minidx*dim+j] != 0.0f)
>                     put_bits(pb, 1, in[i+j] < 0.0f);
>         if(cb == ESC_BT){
>             for(j = 0; j < 2; j++){
>                 if(ff_aac_codebook_vectors[cb-1][minidx*2+j] == 64.0f){
>                     int coef = quant(in[i+j], IQ);
>                     int len = av_log2(coef);
> 
>                     put_bits(pb, len - 4 + 1, (1 << (len - 4 + 1)) - 2);
>                     put_bits(pb, len, coef & ((1 << len) - 1));
>                 }
>             }
>         }
>     }
> }

this looks a little duplicated from the previous function

> 
> /**
>  * structure used in optimal codebook search
>  */
> typedef struct BandCodingPath {
>     int prev_idx; ///< pointer to the previous path point
>     int codebook; ///< codebook for coding band run
>     float cost;   ///< path cost
>     int run;
> } BandCodingPath;
> 
> /**
>  * Encode band info for single window group bands.
>  */
> static void encode_window_bands_info(AACEncContext *s, SingleChannelElement *sce,
>                                      int win, int group_len, const float lambda)
> {
>     BandCodingPath path[120][12];
>     int w, swb, cb, start, start2, size;
>     int i, j;
>     const int max_sfb = sce->ics.max_sfb;
>     const int run_bits = sce->ics.num_windows == 1 ? 5 : 3;
>     const int run_esc = (1 << run_bits) - 1;
>     int idx, ppos, count;
>     int stackrun[120], stackcb[120], stack_len;
> 
>     start = win*128;
>     for(cb = 0; cb < 12; cb++){
>         path[0][cb].cost = 0.0f;
>         path[0][cb].prev_idx = -1;
>         path[0][cb].run = 0;
>     }
>     for(swb = 0; swb < max_sfb; swb++){
>         start2 = start;
>         size = sce->ics.swb_sizes[swb];
>         if(sce->zeroes[win*16 + swb]){
>             for(cb = 0; cb < 12; cb++){
>                 path[swb+1][cb].prev_idx = cb;
>                 path[swb+1][cb].cost = path[swb][cb].cost;
>                 path[swb+1][cb].run = path[swb][cb].run + 1;
>             }
>         }else{
>             float minrd = INFINITY;
>             int mincb = 0;
>             for(cb = 0; cb < 12; cb++){
>                 float rd = 0.0f;
>                 for(w = 0; w < group_len; w++){
>                     rd += quantize_band_cost(sce->coeffs + start + w*128, size, sce->sf_idx[(win+w)*16+swb], cb, lambda, minrd);
>                 }
>                 path[swb+1][cb].prev_idx = cb;
>                 path[swb+1][cb].cost = path[swb][cb].cost + rd;
>                 path[swb+1][cb].run = path[swb][cb].run + 1;
>                 if(rd < minrd){
>                     minrd = rd;
>                     mincb = cb;
>                 }
>             }
>             for(cb = 0; cb < 12; cb++){
>                 float cost = path[swb][cb].cost + minrd + run_bits + 4;
>                 if(cost < path[swb+1][cb].cost){
>                     path[swb+1][cb].prev_idx = mincb;
>                     path[swb+1][cb].cost = cost;
>                     path[swb+1][cb].run = 1;
>                 }
>             }
>         }
>         start += sce->ics.swb_sizes[swb];
>     }

Have you tested this code against the previous cb viterbi search?
do they both find the global optimum? Iam asking because this code looks
wrong in several ways, but i may be missing something. Also the paper looked
wrong from what i remember though not in that many ways, the algo described
there for choosing cbs did seem to ignore the run escapes. This may or may not
be significant, but requires tests to understand what effect ignoring
run escapes has, the previous code and what i suggested did not ignore them.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080914/baaa92a5/attachment.pgp>