[FFmpeg-devel] [RFC] AAC Encoder

Tue Aug 12 07:33:15 CEST 2008

On Mon, Aug 11, 2008 at 08:03:49PM +0200, Michael Niedermayer wrote:
> On Mon, Aug 11, 2008 at 12:18:03PM +0300, Kostya wrote:
> > Here's my implementation of AAC encoder.
> > 
> > Since AAC decoder is not fully in SVN yet, I'm omitting data structures
> > and declarations used from it, I'll sort it after decoder commit.
> > The main difference is that psy model fills data structures and encoder
> > writes bitstream with them, so several structures have additional members.
> > 
> > I've also stripped out model based on 3GPP 26.403 to make review easier.
> 
[...]
> > +/** default channel configurations */
> > +static const uint8_t aac_chan_configs[6][5] = {
> > + {1, ID_SCE},                         // 1 channel  - single channel element
> > + {1, ID_CPE},                         // 2 channels - channel pair
> > + {2, ID_SCE, ID_CPE},                 // 3 channels - center + stereo
> > + {3, ID_SCE, ID_CPE, ID_SCE},         // 4 channels - front center + stereo + back center
> > + {3, ID_SCE, ID_CPE, ID_CPE},         // 5 channels - front center + stereo + back stereo
> > + {4, ID_SCE, ID_CPE, ID_CPE, ID_LFE}, // 6 channels - front center + stereo + back stereo + LFE
> > +};
> 
> can this maybe be used to simplify the decoder?

No, decoder relies on totally different model of representing channels
in order to accept custom channel configurations.

[...]
> > +/**
> > + * Perform windowing and MDCT.
> > + */
> > +static void analyze(AVCodecContext *avctx, AACEncContext *s, ChannelElement *cpe, short *audio, int channel)
> > +{
> > +    int i, j, k;
> > +    const float * lwindow = cpe->ch[channel].ics.use_kb_window[0] ? ff_aac_kbd_long_1024 : ff_aac_sine_long_1024;
> > +    const float * swindow = cpe->ch[channel].ics.use_kb_window[0] ? ff_aac_kbd_short_128 : ff_aac_sine_short_128;
> > +    const float * pwindow = cpe->ch[channel].ics.use_kb_window[1] ? ff_aac_kbd_short_128 : ff_aac_sine_short_128;
> > +
> > +    if (cpe->ch[channel].ics.window_sequence[0] != EIGHT_SHORT_SEQUENCE) {
> > +        memcpy(s->output, cpe->ch[channel].saved, sizeof(float)*1024);
> > +        if(cpe->ch[channel].ics.window_sequence[0] == LONG_STOP_SEQUENCE){
> > +            memset(s->output, 0, sizeof(s->output[0]) * 448);
> > +            for(i = 448; i < 576; i++)
> > +                s->output[i] = cpe->ch[channel].saved[i] * pwindow[i - 448];
> > +            for(i = 576; i < 704; i++)
> > +                s->output[i] = cpe->ch[channel].saved[i];
> > +        }
> > +        if(cpe->ch[channel].ics.window_sequence[0] != LONG_START_SEQUENCE){
> > +            j = channel;
> > +            for (i = 0; i < 1024; i++, j += avctx->channels){
> > +                s->output[i+1024]         = audio[j] / 512.0 * lwindow[1024 - i - 1];
> > +                cpe->ch[channel].saved[i] = audio[j] / 512.0 * lwindow[i];
> > +            }
> > +        }else{
> > +            j = channel;
> > +            for(i = 0; i < 448; i++, j += avctx->channels)
> > +                s->output[i+1024]         = audio[j] / 512.0;
> > +            for(i = 448; i < 576; i++, j += avctx->channels)
> > +                s->output[i+1024]         = audio[j] / 512.0 * swindow[576 - i - 1];
> > +            memset(s->output+1024+576, 0, sizeof(s->output[0]) * 448);
> > +            j = channel;
> > +            for(i = 0; i < 1024; i++, j += avctx->channels)
> > +                cpe->ch[channel].saved[i] = audio[j] / 512.0;
> > +        }
> > +        ff_mdct_calc(&s->mdct1024, cpe->ch[channel].coeffs, s->output, s->tmp);
> > +    }else{
> > +        j = channel;
> > +        for (k = 0; k < 1024; k += 128) {
> > +            for(i = 448 + k; i < 448 + k + 256; i++)
> > +                s->output[i - 448 - k] = (i < 1024) ? cpe->ch[channel].saved[i] : audio[channel + (i-1024)*avctx->channels] / 512.0;
> > +            s->dsp.vector_fmul        (s->output,     k ?  swindow : pwindow, 128);
> > +            s->dsp.vector_fmul_reverse(s->output+128, s->output+128, swindow, 128);
> > +            ff_mdct_calc(&s->mdct128, cpe->ch[channel].coeffs + k, s->output, s->tmp);
> > +        }
> > +        j = channel;
> 
> > +        for(i = 0; i < 1024; i++, j += avctx->channels)
> > +            cpe->ch[channel].saved[i] = audio[j] / 512.0;
> 
> cant all this rescaling be done at somewhere else? window / quant/dequant?

It's funny you say that, 3GPP reference decoder merges rescaling into quantization.
I think I will do the same.

[...]
> > +            }while(w < cpe->ch[channel].ics.num_windows && cpe->ch[channel].ics.group_len[w]);
> > +        }
> > +        if(score < best){
> > +            best = score;
> > +            bestcb = cb;
> > +        }
> > +    }
> > +    return bestcb;
> > +}
> 
> This code does not seem to consider the distortion
> please use rate distortion not just the number of bits!

Umm, why?
By this point psy model produced coefficients to encode and encoder
tries to write them in a minimum number of bits.
It's lossless operation.

> > +
> > +/**
> > + * Encode one scalefactor band with selected codebook.
> > + */
> > +static void encode_codebook(AACEncContext *s, ChannelElement *cpe, int channel, int start, int size, int cb)
> 
> hmmmmmmm
> encode_coeffs
> encode_spectral
> encode_codebook_indexes
> ...
> but not encode_codebook

I will rename it 

> > +{
> > +    const uint8_t *bits = ff_aac_spectral_bits[aac_cb_info[cb].cb_num];
> > +    const uint16_t *codes = ff_aac_spectral_codes[aac_cb_info[cb].cb_num];
> 
> vertical align
> 
> 
> > +    const int dim = (aac_cb_info[cb].flags & CB_PAIRS) ? 2 : 4;
> > +    int i, j, idx;
> > +
> 
> > +    if(!bits || !codes) return;
> 
> one of these is redundant i guess
> 
> 
> [...]
> > +/**
> > + * Encode information about codebooks used for scalefactor bands coding.
> > + */
> > +static void encode_section_data(AVCodecContext *avctx, AACEncContext *s, ChannelElement *cpe, int channel)
> 
> encode_band_types() i suspect ?

yes, but it's called section_data in the standard
I will pick clearer name though. 

> > +{
> > +    int i, w;
> > +    int bits = cpe->ch[channel].ics.num_windows == 1 ? 5 : 3;
> > +    int esc = (1 << bits) - 1;
> > +    int count;
> > +
> > +    for(w = 0; w < cpe->ch[channel].ics.num_windows; w++){
> > +        if(cpe->ch[channel].ics.group_len[w]) continue;
> > +        count = 0;
> > +        for(i = 0; i < cpe->ch[channel].ics.max_sfb; i++){
> > +            if(!i || cpe->ch[channel].band_type[w][i] != cpe->ch[channel].band_type[w][i-1]){
> > +                if(count){
> > +                    while(count >= esc){
> > +                        put_bits(&s->pb, bits, esc);
> > +                        count -= esc;
> > +                    }
> > +                    put_bits(&s->pb, bits, count);
> > +                }
> > +                put_bits(&s->pb, 4, cpe->ch[channel].band_type[w][i]);
> > +                count = 1;
> > +            }else
> > +                count++;
> 
> The code that selects the band types should really already know the number
> of identical band types.
> And i think that should be determined with the viterbi algorithm.
> Its pretty trivial to encode each with several bands (thats already done
> as far as i can see) and then find the global optimal path through the
> table using the viterbi algorithm.

I will do that separately as it will complicate the code for now.
It deserves a TODO comment for now.

> > +        }
> > +        if(count){
> > +            while(count >= esc){
> > +                put_bits(&s->pb, bits, esc);
> > +                count -= esc;
> > +            }
> > +            put_bits(&s->pb, bits, count);
> > +        }
> > +    }
> > +}
> 
> > +
> > +/**
> > + * Encode scalefactors.
> > + */
> > +static void encode_scale_factor_data(AVCodecContext *avctx, AACEncContext *s, ChannelElement *cpe, int channel)
> 
> s/_data/s/
> 
> 
> > +{
> > +    int off = cpe->ch[channel].mixing_gain, diff;
> > +    int i, w;
> > +
> > +    for(w = 0; w < cpe->ch[channel].ics.num_windows; w++){
> > +        if(cpe->ch[channel].ics.group_len[w]) continue;
> > +        for(i = 0; i < cpe->ch[channel].ics.max_sfb; i++){
> > +            if(!cpe->ch[channel].zeroes[w][i]){
> > +                diff = cpe->ch[channel].sf_idx[w][i] - off + SCALE_DIFF_ZERO;
> 
> > +                if(diff < 0 || diff > 120) av_log(avctx, AV_LOG_ERROR, "Scalefactor difference is too big to be coded\n");
> 
> assert() ?

Is it recommended to be used here?  

[...] 
> > +    .capabilities = CODEC_CAP_SMALL_LAST_FRAME,
> > +};
> 
> iam no sure maybe, this needs CODEC_CAP_DELAY as well

seems so 

[...]
> > +
> > +//borrowed from aac.c
> > +static float pow2sf_tab[340];
> 
> duplicate

will get rid of it

[...]
> > +static av_always_inline float lowpass_iir_filter(LPFilterState *s, const float *coeffs, float in)
> > +{
> > +    memmove(s->x, s->x + 1, sizeof(s->x) - sizeof(s->x[0]));
> > +    memmove(s->y, s->y + 1, sizeof(s->y) - sizeof(s->y[0]));
> 
> no we will not do memmoves in av_always_inline functions

loop then 

> > +    s->x[IIR_ORDER] = in * coeffs[1];
> > +    //FIXME: made only for 4th order filter
> > +    s->y[IIR_ORDER] = (s->x[0] + s->x[4])*1 + (s->x[1] + s->x[3])*4 + s->x[2]*6
> > +                    + coeffs[2]*s->y[0] + coeffs[3]*s->y[1] + coeffs[4]*s->y[2] + coeffs[5]*s->y[3];
> > +    return s->y[IIR_ORDER];
> > +}
> 
> the lowpass code belongs in a seperate file and patch and should probably
> be applied outside the codec. Unless of course it uses some feedback from the
> psy model ...
> besides lowpass, things should be run through a highpass filter to remove
> the DC component, its inaudible and expensive to represent after MDCT transform

ok, I will try to prepare such patch
and any pointers about how to remove DC? channel->coeffs[ch][0] = 0.0f seems enough to me 

> [...]
> 
> > +void ff_aac_psy_preprocess(AACPsyContext *ctx, int16_t *audio, int16_t *dest, int tag, int type)
> > +{
> > +    int chans = type == ID_CPE ? 2 : 1;
> > +    const int chstride = ctx->avctx->channels;
> > +    int i, ch;
> > +    float t[2];
> > +
> 
> > +    if(chans == 1 || (ctx->flags & PSY_MODEL_NO_PREPROC) == PSY_MODEL_NO_PREPROC){
> > +        for(ch = 0; ch < chans; ch++){
> > +            for(i = 0; i < 1024; i++){
> > +                dest[i * chstride + ch] = audio[i * chstride + ch];
> > +            }
> > +        }
> 
> memcpy()
> 
> 
> > +    }else{
> > +        for(i = 0; i < 1024; i++){
> > +            if(ctx->flags & PSY_MODEL_NO_ST_ATT){
> > +                for(ch = 0; ch < 2; ch++)
> > +                    t[ch] = audio[i * chstride + ch];
> 
> > +            }else{
> > +                t[0] = audio[i * chstride + 0] * (0.5 + ctx->stereo_att) + audio[i * chstride + 1] * (0.5 - ctx->stereo_att);
> > +                t[1] = audio[i * chstride + 0] * (0.5 - ctx->stereo_att) + audio[i * chstride + 1] * (0.5 + ctx->stereo_att);
> > +            }
> 
> this stereo attenuation looks like it should be in a filter before the codec

logically it is, as it's applied to raw audio before it's used by encoder
but I will gladly move that code to libavfilter when possible 

[...]
> > +    AAC_NB_PSY_MODELS          ///< total number of psychoacoustic models
> > +};
> 
> make it clear that this is not part of he API due to binary compatibility
> when a new model is added

ok

> > +
> > +enum AACPsyModelMode{
> > +    PSY_MODE_CBR,              ///< follow bitrate as closely as possible
> > +    PSY_MODE_ABR,              ///< try to achieve bitrate but actual bitrate may differ significantly
> > +    PSY_MODE_QUALITY,          ///< try to achieve set quality instead of bitrate
> > +};
> 
> please use the bitrate_tolerance parameter for closeness of filesize matching
> and the buffer_size / max and min bitrates for strict CBR
> It might also be interresting to consider using ratecontrol.c

I will try, actual code for it is in model anyway 

> > +
> > +#define PSY_MODEL_MODE_MASK  0x0000000F ///< bit fields for storing mode (CBR, ABR, VBR)
> > +#define PSY_MODEL_NO_PULSE   0x00000010 ///< disable pulse searching
> > +#define PSY_MODEL_NO_SWITCH  0x00000020 ///< disable window switching
> > +#define PSY_MODEL_NO_ST_ATT  0x00000040 ///< disable stereo attenuation
> > +#define PSY_MODEL_NO_LOWPASS 0x00000080 ///< disable low-pass filtering
> > +
> > +#define PSY_MODEL_NO_PREPROC (PSY_MODEL_NO_ST_ATT | PSY_MODEL_NO_LOWPASS)
> > +
> > +#define PSY_MODEL_MODE(a)  ((a) & PSY_MODEL_MODE_MASK)
> > +
> > +/**
> > + * context used by psychoacoustic model
> > + */
> > +typedef struct AACPsyContext {
> > +    AVCodecContext *avctx;
> 
> > +    DSPContext dsp;
> 
> for what does the psy model use dsp ?
> Iam not very happy about this being duplicated between encoder and psy model

nowhere, will drop it 

> > +
> > +    int flags;
> > +    int window_type[2];
> > +    int window_shape[2];
> > +    const uint8_t *bands1024;
> > +    int num_bands1024;
> > +    const uint8_t *bands128;
> > +    int num_bands128;
> > +
> > +    const struct AACPsyModel *model;
> > +    void* model_priv_data;
> > +
> > +    float stereo_att;
> > +    int   cutoff;
> > +    void* lp_state;
> > +}AACPsyContext;
> 
> document the fields please, also true for all other structs where fields
> do not have a immedeatly obvious meaning.

no problems

I'll send actual patches later.

> [...]
> -- 
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> Concerning the gods, I have no means of knowing whether they exist or not
> or of what sort they may be, because of the obscurity of the subject, and
> the brevity of human life -- Protagoras