[FFmpeg-devel] [RFC] AAC Encoder, now more optimal
Michael Niedermayer
michaelni
Sat Sep 6 03:13:54 CEST 2008
On Fri, Sep 05, 2008 at 04:13:58PM +0300, Kostya wrote:
> After some time (I'd like to have more free time to spend on it though),
> I want to expose my new AAC encoder.
>
> It is slower than realtime since it uses search for optimal quantizers
> for given quality.
>
> Known weak points:
> * no bitrate management (it uses fixed quality for now)
> * no M/S detection
> * quantization is not optimal yet
>
> I haven't implemented those because it will slow encoder even more, so
> debug session will include an hour rest (and in case of M/S detection
> I don't know how to implement it) - 16sec sample coding takes 42 secs
> on my PPC G4-1.42GHz and 39 secs on Core2 1.8MHz.
>
> Any comments, suggestions, speedup tricks are extremely welcomed
> (especially the latter).
ok, so only optimization related comments
[...]
> /**
> * Encode band info for single window group bands.
> */
> static void encode_window_bands_info(AACEncContext *s, SingleChannelElement *sce,
> int win, int group_len)
> {
> BandCodingPath path[64];
> int band_bits[64][12];
> int w, swb, cb, start, start2, size;
> int i, j;
> const int max_sfb = sce->ics.max_sfb;
> const int run_bits = sce->ics.num_windows == 1 ? 5 : 3;
> const int run_esc = (1 << run_bits) - 1;
> int bits, sbits, idx, count;
> int stack[64], stack_len;
>
> start = win*128;
> for(swb = 0; swb < max_sfb; swb++){
> int maxval = 0;
> start2 = start;
> size = sce->ics.swb_sizes[swb];
> if(sce->zeroes[win*16 + swb]){
> maxval = 0;
> }else{
> for(w = 0; w < group_len; w++){
> for(i = start2; i < start2 + size; i++){
> maxval = FFMAX(maxval, FFABS(sce->icoefs[i]));
> }
> start2 += 128;
> }
> }
> sbits = calculate_band_sign_bits(s, sce, group_len, start, size);
> for(cb = 0; cb < 12; cb++){
> if(aac_cb_info[cb].maxval < maxval){
> band_bits[swb][cb] = INT_MAX;
> }else{
> band_bits[swb][cb] = calculate_band_bits(s, sce, group_len, start, size, cb);
> if(IS_CODEBOOK_UNSIGNED(cb-1)){
> band_bits[swb][cb] += sbits;
> }
> }
> }
> start += sce->ics.swb_sizes[swb];
> }
> path[0].bits = 0;
> for(i = 1; i <= max_sfb; i++)
> path[i].bits = INT_MAX;
> for(i = 0; i < max_sfb; i++){
> for(cb = 0; cb < 12; cb++){
> int sum = 0;
> for(j = 1; j <= max_sfb - i; j++){
> if(band_bits[i+j-1][cb] == INT_MAX)
> break;
> sum += band_bits[i+j-1][cb];
> bits = sum + path[i].bits + run_value_bits[sce->ics.num_windows == 8][j];
> if(bits < path[i+j].bits){
> path[i+j].bits = bits;
> path[i+j].codebook = cb;
> path[i+j].prev_idx = i;
> }
> }
> }
> }
below should be faster, but _very_ important, check that it still finds the
global optimum, it should. That said, always check all optimizations you do
to ensure they do not change anything unexpected. debuging such things later
can be very time consuming.
for(i = 0; i < max_sfb; i++){
for(cb = 0; cb < 12; cb++){
int r= last[cb].run;
if(run_value_bits[r] == run_value_bits[r+1]){
last[cb].bits += band_bits[i][cb];
last[cb].run++;
bits = band_bits[i][cb] + path[i-1].bits + run_value_bits[1];
if(bits < last[cb].bits){
last[cb].bits= bits;
last[cb].run= run;
}
}else{
int sum = 0;
last[cb].bits = INT_MAX;
for(run = 1; run<=i+1; run++){
sum += band_bits[i-run+1][cb];
bits = sum + path[i-run].bits + run_value_bits[run];
if(bits < last[cb].bits){
last[cb].bits= bits;
last[cb].run= run;
}
}
}
if(last[cb].bits < path[i].bits){
path[i].bits= last[cb].bits;
path[i].codebook= cb;
path[i].run= last[cb].run;
}
}
}
also this should be working with rate distortion values not just bits.
[...]
> typedef struct TrellisPath {
> float cost;
> int prev;
> int min_val;
> int max_val;
> } TrellisPath;
>
> static void search_for_quantizers(AACEncContext *s, ChannelElement *cpe, int channels)
> {
> int q, ch, w, w2, g, start = 0;
> int i;
> int qcoeffs[128];
> int idx;
> TrellisPath paths[256*121];
> int bandaddr[121];
> const float lambda = 5e-7f;
> int minq;
> float mincost;
>
> for(ch = 0; ch < channels; ch++){
isnt that channels loop prettier outside the function?
the rest of this function looks pretty messy and IMHO needs some cleanup
before it can be optimized, optimiztaions will add more mess and that would
not be good for a unction that should be about 1/4 the size it is.
one obvious speed issue is running cbrt() and pow() in the very innermost
loop though.
Also as it is based on approximate quantization errors and bits we cannot
try heuristics, as we have no way to test what effect the heuristics have
thats because we lack a optimal but slow reference.
a very simple optimization though that could be tried is to simple pick the
so far best node and only consider it and all nodes that are less than 19
bits worse, because the max scalefactor diff VLC has 19 bits the others
cant be on the optimal path.
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I have never wished to cater to the crowd; for what I know they do not
approve, and what they approve I do not know. -- Epicurus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080906/dfedb953/attachment.pgp>
More information about the ffmpeg-devel
mailing list