[FFmpeg-devel] Nellymoser encoder
Bartlomiej Wolowiec
bartek.wolowiec
Sun Aug 31 13:07:15 CEST 2008
Saturday 30 August 2008 18:10:41 Michael Niedermayer napisa?(a):
> On Sat, Aug 30, 2008 at 03:42:37PM +0200, Bartlomiej Wolowiec wrote:
> > Friday 29 August 2008 22:36:10 Michael Niedermayer napisa?(a):
> > > > > > > > +
> > > > > > > > +void apply_mdct(NellyMoserEncodeContext *s, float *in, float
> > > > > > > > *coefs) +{
> > > > > > > > + DECLARE_ALIGNED_16(float, in_buff[NELLY_SAMPLES]);
> > > > > > > > +
> > > > > > > > + memcpy(&in_buff[0], &in[0], NELLY_SAMPLES *
> > > > > > > > sizeof(float)); + s->dsp.vector_fmul(in_buff, ff_sine_128,
> > > > > > > > NELLY_BUF_LEN); + s->dsp.vector_fmul_reverse(in_buff +
> > > > > > > > NELLY_BUF_LEN, in_buff + NELLY_BUF_LEN, ff_sine_128,
> > > > > > > > NELLY_BUF_LEN); +
> > > > > > > > ff_mdct_calc(&s->mdct_ctx, coefs, in_buff);
> > > > > > > > +}
> > > > > > >
> > > > > > > The data is copied once in encode_frame and twice here
> > > > > > > There is no need to copy the data 3 times.
> > > > > > > vector_fmul can be used with a singl memcpy to get the data
> > > > > > > into any destination, and vector_fmul_reverse doesnt even need
> > > > > > > 1 memcpy, so overall a single memcpy is enough
> > > > > >
> > > > > > Hope that you meant something similar to my solution.
> > > > >
> > > > > no, you still do 2 memcpy() but now the code is really messy as
> > > > > well.
> > > > >
> > > > > what you should do is, for each block of samples you get from the
> > > > > user 1. apply one half of the window onto it with
> > > > > vector_fmul_reverse and destination of some internal buffer
> > > > > 2. memcpy into the 2nd destination and apply the other half of the
> > > > > window onto it with vector_fmul
> > > > > 3. run the mdct as appropriate on the internal buffers.
> > > >
> > > > Hmm, I considered it, but I don't understand exactly what should I
> > > > change... In the code I copy data two times:
> > > > a) in encode_frame - I convert int16_t to float and copy data to
> > > > s->buf - I need to do it somewhere because vector_mul requires float
> > > > *. Additionally, part of the data is needed to the next call of
> > > > encode_frame b) in apply_mdct - here I think that some additional
> > > > part of buffer is needed. If I understood correctly I have to get rid
> > > > of a), but how to get access to old data when the next call of
> > > > encode_frame is performed and how call vector_fmul on int16_t?
> > >
> > > have you tried setting AVCodec.sample_fmts to SAMPLE_FMT_FLT ?
> > > I think ffmpeg should support this already. If it does not work then we
> > > can keep int16 for now which would implicate more copying
> >
> > Hmm... I tried to use SAMPLE_FMT_FLT, but something doesn't work. I made
> > only that changes:
> >
> > float *samples = data;
> > ...
> > for (i = 0; i < avctx->frame_size; i++) {
> > s->buf[s->bufsel][i] = samples[i]*(1<<15);
> > }
> > ...
> > .sample_fmts = (enum SampleFormat[]){SAMPLE_FMT_FLT,SAMPLE_FMT_NONE},
>
> hmm
Any idea? or should I leave it as it is?
> > [...]
> >
> > > > +
> > > > +#define find_best(val, table, LUT, LUT_add, LUT_size) \
> > > > + best_idx = \
> > > > + LUT[av_clip ((((int)val) >> 8) + LUT_add, 0, LUT_size - 1)];
> > > > \ + if (abs(val - table[best_idx]) > abs(val - table[best_idx +
> > > > 1])) \ + best_idx++;
> > >
> > > (int)some_float is slow, lrintf() should be faster
> >
> > yes, but here rounding down is needed. lrintf(x-0.5) ?
>
> ok but merge that 0.5 into LUT_add
val is divided by 256, so that it doesn't make difference. I change only (int)
to lrintf()
> > > also if val is a float instead of an int then fabs() may actually be
> > > better than abs()
> >
> > ok
> >
> > > > +
> > > > +/**
> > > > + * Encodes NELLY_SAMPLES samples. It assumes, that samples contains
> > > > 3 * NELLY_BUF_LEN values + * @param s encoder context
> > > > + * @param output output buffer
> > > > + * @param output_size size of output buffer
> > > > + */
> > > > +static void encode_block(NellyMoserEncodeContext *s, unsigned char
> > > > *output, int output_size) +{
> > > > + PutBitContext pb;
> > > > + int i, band, block, best_idx, power_idx = 0;
> > > > + float power_val, power_candidate, coeff, coeff_sum;
> > > > + int band_start, band_end;
> > > > + float pows[NELLY_FILL_LEN];
> > > > + int bits[NELLY_BUF_LEN];
> > > > +
> > > > + const float C = 1.0;
> > > > + const float D = 2.0;
> > > > +
> > > > + apply_mdct(s);
> > > > +
> > > > + init_put_bits(&pb, output, output_size * 8);
> > > > +
> > > >
> > > > + band_start = 0;
> > > > + band_end = ff_nelly_band_sizes_table[0];
> > > > + for (band = 0; band < NELLY_BANDS; band++) {
> > > > + coeff_sum = 0;
> > > > + for (i = band_start; i < band_end; i++) {
> > > > + //coeff_sum += s->mdct_out[i ] *
> > > > s->mdct_out[i ] + // +
> > > > s->mdct_out[i + NELLY_BUF_LEN] * s->mdct_out[i + NELLY_BUF_LEN]; +
> > > > coeff_sum += pow(fabs(s->mdct_out[i]), D) +
> > > > pow(fabs(s->mdct_out[i +
> > > > NELLY_BUF_LEN]), D); + }
> > > > + power_candidate =
> > > > + //log(FFMAX(1.0, coeff_sum /
> > > > (ff_nelly_band_sizes_table[band] << 7))) * 1024.0 / M_LN2; +
> > > > C * log(FFMAX(1.0, coeff_sum / (ff_nelly_band_sizes_table[band] <<
> > > > 7))) * 1024.0 / log(D); +
> > > > + if (band) {
> > > > + power_candidate -= power_idx;
> > > > + find_best(power_candidate, ff_nelly_delta_table,
> > > > sf_delta_lut, 37, 78); + put_bits(&pb, 5, best_idx);
> > > > + power_idx += ff_nelly_delta_table[best_idx];
> > > > + } else {
> > > > + //base exponent
> > > > + find_best(power_candidate, ff_nelly_init_table, sf_lut,
> > > > -20, 96); + put_bits(&pb, 6, best_idx);
> > > > + power_idx = ff_nelly_init_table[best_idx];
> > > > + }
> > >
> > > the choice of power_idx/best_idx values could still be tried to be
> > > found with viterbi. Its somewhat similar (and simpler) than our
> > > viterbi/trellis ADPCM encoder
> >
> > Ok. I read some information about viterbi.
> > ( http://en.wikipedia.org/wiki/Viterbi_algorithm ;) ).
> > I see that it should be assumed:
> > states - values - there will be about 2^16 of them?
> > observations -following values of power_candidate
> > probabilities - e.g. |searched value - assumed value|^C
> > transition_probabilit - anything, e.g. 1
>
> Wiki is using the term "viterbi algorithm" in a very narrow sense,
> ive not seen anyone outside wiki do that.
> Wikipedians seem to love describing the same algorithm 50 times
> with different variables and attributed to different authors. Iam not
> disputing that it has been reinvented >50 times in various fields but
> its still the same algorithm.
> I suspect several points of what is under
> "Algorithms that use dynamic programming"
such name tells more :)
> are identical to viterbi just
> with different variable names.
>
> Considering wikis narrow view, "viterbi" might not be the best choice as
> reference as its written with a specific use case in mind that is different
> from ours. Iam not sure if there is a abstract description outside of
> specific use cases on wiki.
>
> > and write it cleverer. It seems to me, that it will be a quite slow
> > algorithm.
>
> yes, if its not written clever it would be.
>
> lets first agree on a little random terminonlogy that is closer to our
> use case
> there are the power_candidate that are the ideal values we want to store
> but likely wont be able to store exactly.
>
> there is a graph with bands*2^16 nodes, a column for each band and row
> for each value.
> And we want to find the path with least "weight" through this graph from
> left to right. The weight of each node is simply the distance (squared)
> from the power_candidate. The steps we are allowed to take correspond to
> the 32 difference or 64 start values that could be stored.
>
> Without optimizations, we would thus have to keep track of up to 2^16
> paths and their weights and try each of these pathes with each of the
> possible 32 differences for each band, thats
>
> 23*2^16*32 ~ 50 million operations per frame
>
> If we now limit ourselfs to just 2000 nodes surrounding each
> power_candidate then there would be just 2000*32 transitions to consider
> but as only <=6 of the values from the difference table would fall within
> the next 2000 area it would be just 2000*6 thus making the overall
> operation count
>
> 23*2000*6 = 276000 operations per frame, which is IMHO acceptable speed
> for a high quality mode.
> And it is unlikely that the global optimal path will have nodes that are
> far away from the 2000 surrounding, besides its easy to adjust this for
> speed vs. quality.
>
Such a?version I tried to do
> An alternative would be to instead of keeping N pathes that are closest
> to the current power_candidate, keep the N so far overall best pathes,
> thats what adpcm.c does and it likely has better quality/per speed
> but is harder to implement
>
> > Maybe, only relying on this idea, it will be better to make possible
> > transition to e.g. 3 best states? (but personally I hardly see here a
> > connection with viterbi...).
>
> optimized viterbi :)
In enclosed patch there is an attempt of writing dynamic allocation of
exponents. It isn't ideal, but I want to be sure if I don't go wrong. It's
quite slow - and I don't know if it can be accelerated more than 2-3 times.?
Because it's so slow and quality is often not really better - I suggest to
leave both versions - how to allow user to choose ?
--
Bartlomiej Wolowiec
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nellymoser5.patch
Type: text/x-diff
Size: 9903 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080831/d2719b5a/attachment.patch>
More information about the ffmpeg-devel
mailing list