[FFmpeg-devel] [RFC] AAC encoder optimizations
Kostya
kostya.shishkov
Sun Sep 14 07:34:11 CEST 2008
On Sun, Sep 14, 2008 at 02:45:02AM +0200, Michael Niedermayer wrote:
> On Sat, Sep 13, 2008 at 10:29:27AM +0300, Kostya wrote:
> > Here's the next experimental version.
> > I've replaced quantization and codebook determining functions
> > and reworked scalefactor search code.
> > Now it's ~30 time slower than realtime on Core2.
>
> what about quality?
> Does it beat faac?
> Does it reach the quality from that paper?
Actually its quality IMO is worse than before.
> >
> > Please comment on those functions and give your suggestions.
>
> Your code looks cleaner but i suspect its untested and more buggy
> than before. I hope iam wrong, and appologize for the flaming comment if
> i am, but if not then we arent moving any closer to svn, rather we move
> in circles.
That's true. Except I try to run it on short samples and test that it
actually works
> Its not funny to spend time thinking about and suggesting some optimizations
> to correct code to see it replaced by totally different and wrong code.
> So my main comment is
> * total lack of communication, i really would appreciate to know what
> you are planing to change, how, why, if it has been tested, and what
> the results of the tests where.
Since I can work on it mostly on spare time during work I would be grateful
if you suggest some tests for quality. Oh, and headphones are one of the
cheapest there.
For now I'm trying to produce slow yet optimal encoder that can be referenced
in the future in production of faster but worse encoder to control
tradeoffs.
> [...]
> > /**
> > * Calculate rate distortion cost for quantizing with given codebook
> > *
> > * @return quantization distortion
> > */
> > static float quantize_band_cost(const float *in, int size, int scale_idx, int cb,
> > const float lambda, const float uplim)
> > {
> > const float Q = ff_aac_pow2sf_tab[200 + scale_idx - SCALE_ONE_POS + SCALE_DIV_512];
> > int i, j, k;
> > float cost = 0;
> > const int dim = cb < FIRST_PAIR_BT ? 4 : 2;
> >
> > START_TIMER
> > if(!cb){
> > for(i = 0; i < size; i++)
> > cost += in[i]*in[i]*lambda;
> > return cost;
> > }
> > for(i = 0; i < size; i += dim){
> > float mincost = INFINITY;
> > int minidx = 0;
> > int minbits = 0;
> > const float *vec = ff_aac_codebook_vectors[cb-1];
> > for(j = 0; j < ff_aac_spectral_sizes[cb-1]; j++, vec += dim){
> > float rd = 0.0f;
> > int curbits = ff_aac_spectral_bits[cb-1][minidx];
> > if(IS_CODEBOOK_UNSIGNED(cb)){
> > for(k = 0; k < dim; k++){
> > float t = fabsf(in[i+k]);
> > float di;
> > //do not code with escape sequence small values
> > if(vec[k] == 64.0f && t < 39.0f*Q){
> > rd = INFINITY;
> > break;
> > }
> > if(vec[k] == 64.0f){//FIXME: slow
> > if(t >= 165140.0f*Q){ // clipped value
> > di = t - 165140.0f;
> > curbits += 21;
> > }else{
> > int c = quant(t, 1.0/Q);
> > di = t - c*cbrt(c)*Q;
> > curbits += av_log2(c)*2 - 4 + 1;
> > }
> > }else{
> > di = t - vec[k]*Q;
> > }
> > if(vec[k] != 0.0f)
> > curbits++;
> > rd += di*di*lambda;
> > }
> > }else{
> > for(k = 0; k < dim; k++){
> > float di = in[i+k] - vec[k]*Q;
> > rd += di*di*lambda;
> > }
> > }
> > rd += curbits;
> > if(rd < mincost){
> > mincost = rd;
> > minidx = j;
> > minbits = curbits;
> > }
> > }
> > cost += mincost;
> > if(cost >= uplim)
> > return uplim;
>
> > minbits = 0;
> > if(IS_CODEBOOK_UNSIGNED(cb))
> > for(j = 0; j < dim; j++)
> > if(ff_aac_codebook_vectors[cb-1][minidx*2+j] != 0.0f)
> > minbits++;
> > cost += minbits;
>
> what is this doing? This looks strange
> cost already includes the bits.
those are sign bits
> > }
> > STOP_TIMER("quantize_band")
>
> I assume this code is eating alot of cpu time currently, so lets optimize
> it, but please leave the bruteforce code for comparission so we can always
> cross check that the optimized code is working correctly.
>
> If we assume lambda and cb where fixed, which they actually are from the
> point of view of this function then picking the best vector becomes rather
> simple. You just need a look up table, the idea is
>
> instead of
> for(j = 0; j < ff_aac_spectral_sizes[cb-1]; j++, vec += dim){
> compare each
> pick best
>
> it would be for dim=2
> a= lrintf(in[i ] * (C/Q));
> b= lrintf(in[i+1] * (C/Q));
> if(a<-T || a>T){
> rd += best_esc_rd[i ];
> a=T+1;
> }
> if(b<-T || b>T){
> rd += best_esc_rd[i+1];
> b=T+1;
> }
> p= lut[a+T][b+T];
> for(j = 0; j < p->count; j++, vec += dim){
> compare each
> pick best
>
> and p->count should generally be a small number like 1-4.
>
> best_esc_rd[] would contain the best rd for each coefficient assuming escape
> coding.
> One important fact here is that the best escape coded value does not
> depend on the code book nor the vector within the code book, thus there
> is no need to calculate it more than once for each coeff.
> It could be calculated like you do currently possibly with trying
> a value or 2 closer to 0 to see if they maybe have a lower RD
>
> whats left is how to calculate the content of the lut and how to
> make a small finite number of such tables for lambda, after all
> each different lambda would seem to need its own table.
>
> Well each entry in the look up table applies to a square
> (or hypercube for dim=4) of input coeff pairs (or quads) and would also
> apply to a range of lambdas. Its content would be a list of cb vectors that
> are RD optimal for at least one coeff pair/quad and lambda to which the
> entry applies.
>
> Iam not sure how simple it would be to calculate the content of the lut
> exactly, though approximating it would be easy, just take a large number
> of random coeff pairs/quads and lamdas, calculate the best vector by
> the code you already have and add that best value to the corresponding
> lut entry when its not already in there.
no problems
> >
> > return cost;
> > }
> >
> > /**
> > * Prepare coefficients for encoding.
> > *
> > * @return sum of coefficient absolute values
> > */
> > static void quantize_and_encode_band(PutBitContext *pb, const float *in, int size,
> > int scale_idx, int cb, const float lambda)
> > {
> > const float Q = ff_aac_pow2sf_tab[200 + scale_idx - SCALE_ONE_POS + SCALE_DIV_512];
> > const float IQ = ff_aac_pow2sf_tab[200 - scale_idx + SCALE_ONE_POS - SCALE_DIV_512];
> > int i, j, k;
> > const int dim = cb < FIRST_PAIR_BT ? 4 : 2;
> > if(!cb)
> > return;
> >
> > for(i = 0; i < size; i += dim){
> > float mincost = INFINITY;
> > int minidx = 0;
> > int minbits = 0;
> > const float *vec = ff_aac_codebook_vectors[cb-1];
> > for(j = 0; j < ff_aac_spectral_sizes[cb-1]; j++, vec += dim){
> > float rd = 0.0f;
> > int curbits = ff_aac_spectral_bits[cb-1][minidx];
> > if(IS_CODEBOOK_UNSIGNED(cb)){
> > for(k = 0; k < dim; k++){
> > float t = fabsf(in[i+k]);
> > float di;
> > if(vec[k] != 0.0f)
> > curbits++;
> > //do not code with escape sequence small values
> > if(vec[k] == 64.0f && t < 39.0f*Q){
> > rd = INFINITY;
> > break;
> > }
> > if(vec[k] == 64.0f){//FIXME: slow
> > if(t*IQ >= 165140.0f){ // clipped value
> > di = t - 165140.0f;
> > curbits += 21;
> > }else{
> > int c = quant(t, IQ);
> > di = t - c*cbrt(c)*Q;
> > curbits += av_log2(c)*2 - 4 + 1;
> > }
> > }else{
> > di = t - vec[k]*Q;
> > }
> > rd += di*di*lambda;
> > }
> > }else{
> > for(k = 0; k < dim; k++){
> > float di = in[i+k] - vec[k]*Q;
> > rd += di*di*lambda;
> > }
> > }
> > rd += curbits;
> > if(rd < mincost){
> > mincost = rd;
> > minidx = j;
> > minbits = curbits;
> > }
> > }
> > put_bits(pb, ff_aac_spectral_bits[cb-1][minidx], ff_aac_spectral_codes[cb-1][minidx]);
> > if(IS_CODEBOOK_UNSIGNED(cb))
> > for(j = 0; j < dim; j++)
> > if(ff_aac_codebook_vectors[cb-1][minidx*dim+j] != 0.0f)
> > put_bits(pb, 1, in[i+j] < 0.0f);
> > if(cb == ESC_BT){
> > for(j = 0; j < 2; j++){
> > if(ff_aac_codebook_vectors[cb-1][minidx*2+j] == 64.0f){
> > int coef = quant(in[i+j], IQ);
> > int len = av_log2(coef);
> >
> > put_bits(pb, len - 4 + 1, (1 << (len - 4 + 1)) - 2);
> > put_bits(pb, len, coef & ((1 << len) - 1));
> > }
> > }
> > }
> > }
> > }
>
> this looks a little duplicated from the previous function
it is, but used for final encoding. I should cache values instead.
> >
> > /**
> > * structure used in optimal codebook search
> > */
> > typedef struct BandCodingPath {
> > int prev_idx; ///< pointer to the previous path point
> > int codebook; ///< codebook for coding band run
> > float cost; ///< path cost
> > int run;
> > } BandCodingPath;
> >
> > /**
> > * Encode band info for single window group bands.
> > */
> > static void encode_window_bands_info(AACEncContext *s, SingleChannelElement *sce,
> > int win, int group_len, const float lambda)
> > {
> > BandCodingPath path[120][12];
> > int w, swb, cb, start, start2, size;
> > int i, j;
> > const int max_sfb = sce->ics.max_sfb;
> > const int run_bits = sce->ics.num_windows == 1 ? 5 : 3;
> > const int run_esc = (1 << run_bits) - 1;
> > int idx, ppos, count;
> > int stackrun[120], stackcb[120], stack_len;
> >
> > start = win*128;
> > for(cb = 0; cb < 12; cb++){
> > path[0][cb].cost = 0.0f;
> > path[0][cb].prev_idx = -1;
> > path[0][cb].run = 0;
> > }
> > for(swb = 0; swb < max_sfb; swb++){
> > start2 = start;
> > size = sce->ics.swb_sizes[swb];
> > if(sce->zeroes[win*16 + swb]){
> > for(cb = 0; cb < 12; cb++){
> > path[swb+1][cb].prev_idx = cb;
> > path[swb+1][cb].cost = path[swb][cb].cost;
> > path[swb+1][cb].run = path[swb][cb].run + 1;
> > }
> > }else{
> > float minrd = INFINITY;
> > int mincb = 0;
> > for(cb = 0; cb < 12; cb++){
> > float rd = 0.0f;
> > for(w = 0; w < group_len; w++){
> > rd += quantize_band_cost(sce->coeffs + start + w*128, size, sce->sf_idx[(win+w)*16+swb], cb, lambda, minrd);
> > }
> > path[swb+1][cb].prev_idx = cb;
> > path[swb+1][cb].cost = path[swb][cb].cost + rd;
> > path[swb+1][cb].run = path[swb][cb].run + 1;
> > if(rd < minrd){
> > minrd = rd;
> > mincb = cb;
> > }
> > }
> > for(cb = 0; cb < 12; cb++){
> > float cost = path[swb][cb].cost + minrd + run_bits + 4;
> > if(cost < path[swb+1][cb].cost){
> > path[swb+1][cb].prev_idx = mincb;
> > path[swb+1][cb].cost = cost;
> > path[swb+1][cb].run = 1;
> > }
> > }
> > }
> > start += sce->ics.swb_sizes[swb];
> > }
>
> Have you tested this code against the previous cb viterbi search?
> do they both find the global optimum? Iam asking because this code looks
> wrong in several ways, but i may be missing something. Also the paper looked
> wrong from what i remember though not in that many ways, the algo described
> there for choosing cbs did seem to ignore the run escapes. This may or may not
> be significant, but requires tests to understand what effect ignoring
> run escapes has, the previous code and what i suggested did not ignore them.
Well, run escape can happen only for run lengths >= 31 (with 51 max)
for long window and >=7 of 15 bands for short windows. I will add check for
that and recheck code.
> [...]
> --
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Everything should be made as simple as possible, but not simpler.
> -- Albert Einstein
More information about the ffmpeg-devel
mailing list