[FFmpeg-devel] [RFC] AAC Encoder

Tue Aug 12 19:09:36 CEST 2008

On Tue, Aug 12, 2008 at 02:14:20PM +0200, Michael Niedermayer wrote:
> On Tue, Aug 12, 2008 at 08:33:15AM +0300, Kostya wrote:
> > On Mon, Aug 11, 2008 at 08:03:49PM +0200, Michael Niedermayer wrote:
> > > On Mon, Aug 11, 2008 at 12:18:03PM +0300, Kostya wrote:
> > > > Here's my implementation of AAC encoder.
> > > > 
> > > > Since AAC decoder is not fully in SVN yet, I'm omitting data structures
> > > > and declarations used from it, I'll sort it after decoder commit.
> > > > The main difference is that psy model fills data structures and encoder
> > > > writes bitstream with them, so several structures have additional members.
> > > > 
> > > > I've also stripped out model based on 3GPP 26.403 to make review easier.
> > > 
> [...]
> > [...]
> > > > +            }while(w < cpe->ch[channel].ics.num_windows && cpe->ch[channel].ics.group_len[w]);
> > > > +        }
> > > > +        if(score < best){
> > > > +            best = score;
> > > > +            bestcb = cb;
> > > > +        }
> > > > +    }
> > > > +    return bestcb;
> > > > +}
> > > 
> > > This code does not seem to consider the distortion
> > > please use rate distortion not just the number of bits!
> >  
> > Umm, why?
> > By this point psy model produced coefficients to encode and encoder
> > tries to write them in a minimum number of bits.
> > It's lossless operation.
> 
> We have a problem here, because this isnt optimal
> It seems we agree that each bit counts the same no matter what psy says.
> Maybe a example will best show the problem
> lets assume we have a coeff of 11.5, the psy model decides that a change
> to 10 would be ok for the given audio quality/bitrate and thus outputs 10
> let us assume that storing a coefficient of 10 and one of 11 both take
> 7 bit, the decission to store 10 clearly was bad. OTOH it could have
> been that storing 11 requires twice as many bits in which case the
> decission would have been good. One simply cannot quantize values optimally
> without considering the number of bits they need. This is even more true
> for vector quantization based codecs than it is for scalar quantization.
> it may very well be that psy thinks that both {-1,1} and {-2,0} are an
> equally good representation of the exact {-1.5,0.5} but its not until
> the encoding that it becomes known which of the two need fewer bits.
> 
> Id say the psy model should return an array of perceptual weights W[i]
> and the bitstream encode should choose the (global) minimum of
> bits[i] + distortion(W[i], coeff[i]-stored[i])
> where distortion is a appropriate function whos output matches how audible
> a change is, this may be a simple W[i]*(coeff[i]-stored[i])^2 but iam no
> psychoacoustic expert so there may be better choices.
> 
> And of course the suggested system above needs to be compared to what you
> have currenty so that we can be sure it really does sound better.

I understand what you mean but I suspect that is of complexity O("shaving piglets").

I followed 3GPP TS26.403 which relies on perceptual entropy which more
or less corresponds to the number of bits needed to code it since it's easier.
Anyway, it would be easy to implement psy model that will consider
real coding cost vs. distortion.

> [...]
> > > > +        }
> > > > +        if(count){
> > > > +            while(count >= esc){
> > > > +                put_bits(&s->pb, bits, esc);
> > > > +                count -= esc;
> > > > +            }
> > > > +            put_bits(&s->pb, bits, count);
> > > > +        }
> > > > +    }
> > > > +}
> > > 
> > > > +
> > > > +/**
> > > > + * Encode scalefactors.
> > > > + */
> > > > +static void encode_scale_factor_data(AVCodecContext *avctx, AACEncContext *s, ChannelElement *cpe, int channel)
> > > 
> > > s/_data/s/
> > > 
> > > 
> > > > +{
> > > > +    int off = cpe->ch[channel].mixing_gain, diff;
> > > > +    int i, w;
> > > > +
> > > > +    for(w = 0; w < cpe->ch[channel].ics.num_windows; w++){
> > > > +        if(cpe->ch[channel].ics.group_len[w]) continue;
> > > > +        for(i = 0; i < cpe->ch[channel].ics.max_sfb; i++){
> > > > +            if(!cpe->ch[channel].zeroes[w][i]){
> > > > +                diff = cpe->ch[channel].sf_idx[w][i] - off + SCALE_DIFF_ZERO;
> > > 
> > > > +                if(diff < 0 || diff > 120) av_log(avctx, AV_LOG_ERROR, "Scalefactor difference is too big to be coded\n");
> > > 
> > > assert() ?
> > 
> > Is it recommended to be used here?  
> 
> i do not know, question is what you mean by this check?
> is it that it cannot happen? -> assert()
> is it that it would break the bitstream -> bug that should be fixed
> is it that it is a inevitable shortcomming of the bitstream format
>    so that there is no way for an encoder to encode some kinds of
>    input -> it should probably be just WARNING and more clear

The second one. It's convenient to catch psy model errors.

> [...]
> > [...]
> > > > +static av_always_inline float lowpass_iir_filter(LPFilterState *s, const float *coeffs, float in)
> > > > +{
> > > > +    memmove(s->x, s->x + 1, sizeof(s->x) - sizeof(s->x[0]));
> > > > +    memmove(s->y, s->y + 1, sizeof(s->y) - sizeof(s->y[0]));
> > > 
> > > no we will not do memmoves in av_always_inline functions
> > 
> > loop then 
> 
> hmpf, thats not what i meant.
> write the filter so it does not need to move data around.
> i think unrolling the loop by 4 or something would allow fixed indexes
> to be used.
> 
> 
> >  
> > > > +    s->x[IIR_ORDER] = in * coeffs[1];
> > > > +    //FIXME: made only for 4th order filter
> > > > +    s->y[IIR_ORDER] = (s->x[0] + s->x[4])*1 + (s->x[1] + s->x[3])*4 + s->x[2]*6
> > > > +                    + coeffs[2]*s->y[0] + coeffs[3]*s->y[1] + coeffs[4]*s->y[2] + coeffs[5]*s->y[3];
> > > > +    return s->y[IIR_ORDER];
> > > > +}
> > > 
> > > the lowpass code belongs in a seperate file and patch and should probably
> > > be applied outside the codec. Unless of course it uses some feedback from the
> > > psy model ...
> > > besides lowpass, things should be run through a highpass filter to remove
> > > the DC component, its inaudible and expensive to represent after MDCT transform
> > 
> > ok, I will try to prepare such patch
> > and any pointers about how to remove DC? channel->coeffs[ch][0] = 0.0f seems enough to me 
> 
> ^_^;;
> no, channel->coeffs[ch][0] = 0.0f is not good.
> MDCT is BAD at storing DC, just try it, it will smear a flat 1.0 over all
> coefficients. coeff[0] should be the biggest but its not the only one.
> iam no IIR filter expert so i cannot recommand any specific one. Also this
> is not overly important, iam fine if you just add a FIXME so its not forgotten.
> If you do want to work on this, well (citeseer, google, wikipedia and maybe a
> quick grep in AC3/AAC specs for highpass would be places to find some filter)

ok, I will look 

> >  
> > > [...]
> > > 
> > > > +void ff_aac_psy_preprocess(AACPsyContext *ctx, int16_t *audio, int16_t *dest, int tag, int type)
> > > > +{
> > > > +    int chans = type == ID_CPE ? 2 : 1;
> > > > +    const int chstride = ctx->avctx->channels;
> > > > +    int i, ch;
> > > > +    float t[2];
> > > > +
> > > 
> > > > +    if(chans == 1 || (ctx->flags & PSY_MODEL_NO_PREPROC) == PSY_MODEL_NO_PREPROC){
> > > > +        for(ch = 0; ch < chans; ch++){
> > > > +            for(i = 0; i < 1024; i++){
> > > > +                dest[i * chstride + ch] = audio[i * chstride + ch];
> > > > +            }
> > > > +        }
> > > 
> > > memcpy()
> > > 
> > > 
> > > > +    }else{
> > > > +        for(i = 0; i < 1024; i++){
> > > > +            if(ctx->flags & PSY_MODEL_NO_ST_ATT){
> > > > +                for(ch = 0; ch < 2; ch++)
> > > > +                    t[ch] = audio[i * chstride + ch];
> > > 
> > > > +            }else{
> > > > +                t[0] = audio[i * chstride + 0] * (0.5 + ctx->stereo_att) + audio[i * chstride + 1] * (0.5 - ctx->stereo_att);
> > > > +                t[1] = audio[i * chstride + 0] * (0.5 - ctx->stereo_att) + audio[i * chstride + 1] * (0.5 + ctx->stereo_att);
> > > > +            }
> > > 
> > > this stereo attenuation looks like it should be in a filter before the codec
> > 
> > logically it is, as it's applied to raw audio before it's used by encoder
> > but I will gladly move that code to libavfilter when possible 
> 
> add a FIXME at least please,ideally at the top of the file so its easily found
> needs to be done when other prerequesties are available.

that's a good idea, I will move all FIXMEs to the top of the files

patch will be sent tomorrow

> [...]
> 
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> The greatest way to live with honor in this world is to be what we pretend
> to be. -- Socrates