[FFmpeg-devel] [PATCH] WMA Voice decoder

Sun Jan 31 00:48:31 CET 2010

On Sat, Jan 30, 2010 at 05:06:45PM -0500, Ronald S. Bultje wrote:
> +static const struct frame_type_desc {
> +    short   n_blocks,     ///< amount of blocks per frame
> +                          ///< (each block contains 160/#n_blocks samples)
> +            log_n_blocks, ///< log2(n_blocks)
> +            acb_type,     ///< Adaptive codebook type in frame/block:
> +                          ///< - 0: no adaptive codebook (only hardcoded fixed)
> +                          ///< - 1: adaptive codebook with per-frame pitch,
> +                          ///< - 2: adaptive codebook with per-block pitch
> +            fcb_type,     ///< Fixed codebook type in frame/block:
> +                          ///< -   0: hardcoded codebook, per-frame gain,
> +                          ///<        is used as comfort noise during silence,
> +                          ///< -   1: hardcoded codebook, per-block gain,
> +                          ///< -   2: pitch-adaptive window (AW) pulse signal,
> +                          ///< - 3-5: innovation (fixed) codebook pulses with
> +                          ///<        different combinations of single/double
> +                          ///<        pulses (see #dbl_pulses)
> +            dbl_pulses,   ///< how many pulse vectors have pulse pairs
> +                          ///< (rather than just one single pulse)
> +                          ///< only if #fcb_type >= 3 && <= 5
> +            frame_size;   ///< the amount of bits that make up the block
> +                          ///< data (per frame)

"short"? Also specifying the type only once is ugly.
Also if you use frame_size with a shift they all can be uint8_t
(though probably it's better to just "waste" those 34 bytes and
leave frame_size 16 bits, but the others don't need to be.

> +    uint8_t sframe_cache[SFRAME_CACHE_SIZE + FF_INPUT_BUFFER_PADDING_SIZE];
> +                                  ///< cache for superframe data split over
> +                                  ///< multiple packets
> +    PutBitContext pb;             ///< points into #sframe_cache

"points" is a strange wording.

> +    int cache_sframe_size;        ///< set to >0 if we have data from an

I'd put that right next to sframe_cache.
Also having cache_sframe_size and SFRAME_CACHE_SIZE is a bit confusing,
the macro maybe should have a "MAX", it should be consistently either
sframe_cache or cache_sframe.

> +    short aw_n_pulses[2];         ///< number of AW-pulses in each block
> +    short aw_first_pulse_off[2];  ///< index of first sample to which to
> +                                  ///< apply AW-pulses, or -0xff if unset

More shorts...

> +    float gain_pred_err[6];       ///< cache for future gain prediction
> +    float excitation_history[MAX_SIGNAL_HISTORY];
> +                                  ///< cache of the signal of previous
> +                                  ///< superframes, used as a history for
> +                                  ///< future signal generation

"future ..." sounds a bit strange to me, I would understand that to mean
an unimplemented feature, but that's probably not meant.

> +static av_cold int decode_vbmtree(GetBitContext *gb, uint8_t *vbm_tree)
> +{
> +    unsigned int cntr[8], n, res;
> +
> +    memset(vbm_tree, -1, sizeof(uint8_t) * 25);

Personally I consider 0xff nicer with memset
Also sizeof(uint8_t) sure is pointless.
And maybe change
uint8_t *vbm_tree to uint8_t vbm_tree[25] to document the expected size?

> +    memset(cntr,      0, sizeof(cntr));
> +    for (n = 0; n < 17; n++) {
> +        res = get_bits(gb, 3);
> +        if (cntr[res] >= 3 + (res == 7))
> +            return -1;

What is the reason this can't just be cntr[res] > 3?
(assuming performance matters, documenting can't hurt
either way).

> +        vbm_tree[res * 3 + cntr[res]++] = n;
> +    }

Making cntr unsigned seems like quite some overkill (I think we generally
use int unless we need to use unsigned, right?).

> +    /**
> +     * Extradata layout:
> +     * - byte  0-18: WMAPro-in-WMAVoice extradata (see wmaprodec.c),
> +     * - byte 19-22: flags field (annoyingly in LE; see below for known
> +     *               values),
> +     * - byte 23-46: variable bitmode tree (really just 25 * 3 bits,
> +     *               rest is 0).
> +     */
> +    if (ctx->extradata_size != 0x2E) {
> +        av_log(ctx, AV_LOG_ERROR,
> +               "Invalid extradata size 0x%x != 0x2E\n", ctx->extradata_size);

Description in decimal + check in hex = bad

> +    if (flags & 0x1000) {

A name for that wouldn't hurt.

> +    for (n = 0; n < s->lsps; n++)
> +        s->prev_lsps[n] = M_PI * (n + 1.0) / (s->lsps + 1.0);

Btw. if this or anything else is state that is kept between frames,
you should have a flush function that resets it to avoid artefacts
after seeking.

> +    s->min_pitch_val    = ((int) ((ctx->sample_rate << 8) * 0.0025) + 50) >> 8;
> +    s->max_pitch_val    = ((int) ((ctx->sample_rate << 8) * 0.0185) + 50) >> 8;

Those floats in there look very suspicious to me.
Wouldn't / 400 and * 37 / 2000 be more reliable (assuming the *32 can not cause an overflow)?

> +        av_log(ctx, AV_LOG_ERROR, "Unsupported samplerate %d (causes history=%d, max=%d)\n",

"causes" is a strange word to use.

> +    s->block_pitch_range        = s->block_conv_table[2] +
> +                                  s->block_conv_table[3] + 1 +
> +                                  2 * (s->block_conv_table[1] -
> +                                       (2 * s->min_pitch_val));

Not very readable. Maybe also because of the pointless innermost ()

> +static int get_vbm_bits(GetBitContext *gb)
> +{
> +    int n, res;
> +
> +    for (n = 0; ; n++) {
> +        res = get_bits(gb, 2);
> +        if (res < 3 || n == 6 /** don't increase n to 7 */)
> +            break;
> +    }

So
for (n = -1; n < 7 && res < 3; n++)
or maybe better
for (n = 0; n < 8 && res < 3; n++) {
...
}
return 3 * (n - 1) + res;
or whatever
Also don't use doxy-style comments for things that doxygen can not
make any sense of.

> + * @param sizes range (well, max. value) of each quantized value in @values

well...

> +static void dequant_lsps(double *lsps, int num,
> +                         const uint16_t *values, const uint16_t *sizes,
> +                         int n_stages, const uint8_t *table,
> +                         const double *mul_q, const double *base_q)
> +{
> +    int n, m;
> +
> +    memset(lsps, 0, num * sizeof(double));

sizeof(*lsps), we should not make it needlessly hard to switch to float.

> +    for (n = 0; n < n_stages; table += sizes[n++] * num)

Only the n++ belongs inside the for(), everything else IMO is too hard to read.

> +        for (m = 0; m < num; m++)
> +            lsps[m] += base_q[n] + mul_q[n] * table[m + values[n] * num];

I consider it better to extract constant stuff like values[n]*num outside the loop.
Particularly since due to aliasing rules the compiler might not be able to
to pull e.g. the base_q[n] out without that (hm, I admit I don't know for sure
if the const might be enough for that).

> +    for (n = 0; n < 10; n++) {
> +        a1[n]      = ipol_tab[interpol][0][n] * (old[n] - i_lsps[n]) +
> +                     i_lsps[n];
> +        a1[10 + n] = ipol_tab[interpol][1][n] * (old[n] - i_lsps[n]) +
> +                     i_lsps[n];

The subtraction is duplicated.
Also if speed matters it might make sense to check if gcc handles a flattened
ipol_tab better (i.e. ipol_tab[interpol][10 + n] instead of ipol_tab[interpol][1][n]).

> +    dequant_lsps(&lsps[5],  5, &v[2], &vec_sizes[2], 2,
> +                 ff_wmavoice_dq_lsp16i2, &mul_lsf[2], &base_lsf[2]);

I think lsps + 5, v + 2 etc. is the more common style in FFmpeg.

> +#define NO_OFFSET -0xff

Why -255? And using negative hex numbers is really close to obfuscation IMO.

> +/**
> + * Parse the offset of the first pitch-adaptive window pulses, and
> + * the distribution of pulses between the two blocks in this frame.
> + * @param ctx WMA Voice decoding context
> + * @param gb bit I/O context
> + * @param pitch pitch for each block in this frame
> + */
> +static void aw_parse_coords(AVCodecContext *ctx, GetBitContext *gb,
> +                            const short *pitch)
> +{
> +    static const uint8_t start_offset[94] = {
> +          0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,
> +         24,  26,  29,  28,  30,  31,  32,  33,  34,  35,  36,  37,
> +         38,  39,  40,  41,  42,  43,  44,  46,  48,  50,  52,  54,
> +         56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,  78,
> +         80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102,
> +        104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126,
> +        128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150,
> +        152, 154, 156, 158, 160, 162, 164, 166, 168, 170

Uh, what's the magic behind this table?
And is that one non-monotonous value really correct?

> +    short off_table[11], first_idx[2];

More shorts. Get rid of them all.

> +    s->aw_idx_is_ext = 0;
> +    if ((bits = get_bits(gb, 6)) >= 54) {

IMO never ever munge assignments and ifs, it is pointless obfuscation.

> +    s->aw_pitch_range = 16 + 8 * (FFMIN(pitch[0], pitch[1]) > 32);

Both
s->aw_pitch_range = 16;
if (FFMIN(pitch[0], pitch[1]) > 32)
    s->aw_pitch_range += 8;
and
s->aw_pitch_range = FFMIN(pitch[0], pitch[1]) > 32 ? 24 : 16;
are a lot more readable.

> +    n = 0;
> +    for (offset = start_offset[bits] - 11;
> +         offset < MAX_FRAMESIZE && n < 11;
> +         offset += pitch[offset >= MAX_FRAMESIZE / 2])
> +        off_table[n++] = offset >= 0 ? offset : NO_OFFSET;
> +    while (n < 11)
> +        off_table[n++] = NO_OFFSET;

Argl *drops from chair on floor*
How about making off_table int16_t, NO_OFFSET -1
and doing
memset(off_table, 0xff, sizeof(off_table);
for (n = 0; n < 11 && offset < MAX_FRAMESIZE; n++)
{
    if (offset >= 0)
        off_table[n] = offset;
    offset += pitch[offset >= MAX_FRAMESIZE / 2];
}
Assuming pitch can be negative, else further simplification
would of course be possible.
Well, actually even if not you could probably do some stuff,
but for a loop that's only done 11 times it's not worth it I guess.

> +    if (first_idx[0] > 0)
> +        while (s->aw_first_pulse_off[0] - pitch[0] + s->aw_pitch_range > 0)
> +            s->aw_first_pulse_off[0] -= pitch[0];
> +    if (first_idx[1] > 0)
> +        while (s->aw_first_pulse_off[1] - pitch[1] + s->aw_pitch_range > 0)
> +            s->aw_first_pulse_off[1] -= pitch[1];

Code duplication?

> +    if (s->aw_n_pulses[0]) {
> +        if (block_idx == 0) {
> +            range = 32;
> +        } else { ///< block_idx == 1

no pointless doxy syntax please.

> +#define BIT_IS_SET(idx) use_mask[idx >> 5] & (1 << (idx & 31))
> +#define UNSET_BIT(idx)  use_mask[idx >> 5] &= ~(1 << (idx & 31))
> +
> +    if (pulse_off != NO_OFFSET)
> +        for (n = 0; n < 11; n++) {
> +            m = pulse_off + n * fcb->pitch_lag;
> +            for (idx = m; idx < m + s->aw_pitch_range; idx++)
> +                if (idx >= 0 && idx < MAX_FRAMESIZE / 2) UNSET_BIT(idx);

Not sure about that.

> +            for (idx = 0; idx < MAX_FRAMESIZE / 2; idx++)
> +                if (BIT_IS_SET(idx)) break;

But that looks about like the most inefficient way to scan 80 bits for
a non-zero one ever conceived.

> +        if (s->aw_pitch_range == 24) { ///< 3 pulses, 1:sign + 3:index each
> +        } else { ///< 4 pulses, 1:sign + 2:index each

I guess you can find all of those on your own.

> +            while (fcb->x[fcb->n] < 0)
> +                fcb->x[fcb->n] += fcb->pitch_lag;

What are the values here? This might be a particularly slow way to calculate
if (fcb->x[fcb->n] < 0)
    fcb->x[fcb->n] = (fcb->x[fcb->n] % fcb->pitch_lag) + fcb->pitch_lag;

> +    } else {
> +        int num2 = (val & 0x1FF) >> 1, delta, idx;
> +
> +        if (num2 < 79)       delta = 1;
> +        else if (num2 < 156) delta = 2;
> +        else if (num2 < 231) delta = 3;
> +        else                 delta = 4;
> +        idx    = (delta * delta + num2) % 80;

What?

> +        delta += delta - 1;

Ah. Still, a multiplication to calculate the square of
values 1 to 4 is maybe a bit overkill.

> +        fcb->x[fcb->n]       = idx - delta;
> +        fcb->y[fcb->n++]     = v;
> +        fcb->x[fcb->n]       = idx;
> +        fcb->y[fcb->n++]     = (val & 1) ? -v : v;

The increments on a separate line IMHO would be more readable.

> +    for (n = 0; n < size; n++)
> +        excitation[n] = ff_wmavoice_std_codebook[r_idx + n] * gain;

No dsputil function for that I guess?

> +    memset(pulses, 0, sizeof(float) * size);

sizeof(*pulses)

> +            float pulse = get_bits1(gb) ? 1.0 : -1.0;

I've seen that a lot. Maybe worth optimizing?

> +    memmove(&s->gain_pred_err[8 >> frame_desc->log_n_blocks],
> +            s->gain_pred_err,
> +            sizeof(float) * (6 - (8 >> frame_desc->log_n_blocks)));

sizeof(type) is rarely useful, maybe check them all.

> +    for (n = 0; n < 8 >> frame_desc->log_n_blocks; n++)
> +        s->gain_pred_err[n] = pred_err;

That 8 >> ... is used a lot, give it a variable with a proper name.

> +            int pitch_sh8 = (s->last_pitch_val << 8) +
> +                ((s->pitch_diff_sh16 * (block_idx * size + n)) >> 8);

Maybe split that in a few lines.

> +            ff_acelp_interpolatef(&excitation[n], &excitation[n - pitch],
> +                                  ff_wmavoice_ipol1_coeffs, 17,
> +                                  (idx >= 0) | (abs(idx) << 1), 9, 1);

Maybe FFABS? Also that kind of thing is quite common, there may be a function for it.

> +            for (n = 0; n < size; n++)
> +                excitation[n] = excitation[n - block_pitch];

Note: I think I have seen this a few times already, this applies to all cases.
This is a quite inefficient way to do it, have a look at
memcpy_backptr (and possibly optimize that, too - I think we have a few functions
available now that help like AV_COPY).
Remaining part not reviewed, I'm too tired now.