[FFmpeg-devel] [PATCH] DNxHD/VC-3 encoder

Mon Oct 1 16:03:11 CEST 2007

On Mon, Sep 24, 2007 at 01:00:12PM +0200, Baptiste Coudurier wrote:
> Hi Michael,
> 
> Michael Niedermayer wrote:
> > Hi
> > 
> > On Fri, Sep 07, 2007 at 08:05:28PM +0200, Michael Niedermayer wrote:
> >> Hi
> >>
> >> On Fri, Sep 07, 2007 at 09:17:49AM -0600, Loren Merritt wrote:
> >>> On Fri, 7 Sep 2007, Michael Niedermayer wrote:
> >>>
> >>>> when people start using qsort you know theres something wrong in their
> >>>> encoder (or at least not optimal)
> >>>>
> >>>> so, lets first check if i understand the constraints
> >>>> * its possible to change qscale per macroblock and it doesnt cost anything
> >>>> * theres a fixed number of bits available for each frame (pre mpeg1 design)
> >>>>
> >>>> in that case the optimal solution (though not fastest) is to do a
> >>>> binary search over lambda (using the number of bits and the ones available
> >>>> to find the best lambda which results in a frame which is not too large)
> >>>>
> >>>> then for each such lambda, find the best qscale for each MB, again by
> >>>> binary search minimizing SSD + bits*lambda (that way we know the best
> >>>> qscales and how many bits the frame would need)
> >>> Your algorithm produces the same decisions as the one with qsort. I don't 
> >>> know which is faster.
> >>>
> >>> Consider the list of possible block encodings sorted by 
> >>> delta_ssd/delta_bits. Optimizing for qps at any given lambda will produce 
> >>> some prefix of that list. i.e. for any lambda, there is some 
> >>> position in the list such that all preceding entries have ssd/bits 
> >>> better than lambda, and all following entries have ssd/bits worse 
> >>> than lambda. So instead of evaluating bits(lambda) as a black box and 
> >>> searching for a specified values of bits, construct the function 
> >>> lambda(bits) and evaluate it at the specified value of bits.
> >> yes, but as far as i can see thats not what the current code does, it just
> >> tries qscale and qscale+1 if it would try all qscale then yes they would
> >> be identical
> > 
> > just to clarify my original review ...
> > i dont mind if the code uses qsort, what i was complaining about was that
> > it isnt finding the optimal (in the RD sense) solution
> > 
> > if its changed to find the optimal one or its demonstrated that the
> > PSNR/bitrate difference is negligible then iam happy with that part
> > 
> 
> I tried lagrange. I keep the qsort rc because it is useable (> real time
> on quad core cpu with variance, -5 fps for ssd), rdo one is encoding at
> 4 fps :/
> 
> Im wondering if I did screwed up the implementation because I have an
> IMHO negligeable gain with rdo.
> 
> I reused existing quant functions.
> 
> Threaded functions must return int else you get a warning.
> 
> About encode_dc, it will be complicated to reuse existing ones because:
> mpeg12 use some static tables and generation code.
> mjpeg use uint16_t for dc_codes.
> 
> About mb_var_thread, existing function use current_picture and set
> variables I don't need, mb_mean, and use me field too ...
> 
> Attached are some psnr results with one HD uncompressed sample I have.
> 
> Any other optimization because what Loren suggested earlier ?

[...]
> +static int dnxhd_init_qmat(DNXHDEncContext *ctx, int lbias, int cbias)
> +{
> +    int qscale, i;
> +    int (*qmat_luma  )[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_luma));
> +    int (*qmat_chroma)[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_chroma));
> +    uint16_t (*qmat16_luma  )[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_luma));
> +    uint16_t (*qmat16_chroma)[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_chroma));
> +
> +    for (qscale = 1; qscale < ctx->avctx->qmax; qscale++) {
> +        for (i = 1; i < 64; i++) {
> +            int j = ff_zigzag_direct[i];
> +            qmat_luma  [qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->  luma_weigth[i]));
> +            qmat_chroma[qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->chroma_weigth[i]));
> +
> +            qmat16_luma  [qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->  luma_weigth[i]);
> +            qmat16_chroma[qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->chroma_weigth[i]);
> +
> +            if (qmat16_luma  [qscale][0][j]==0 || qmat16_luma  [qscale][0][j]==128*256) qmat16_luma  [qscale][0][j]=128*256-1;
> +            if (qmat16_chroma[qscale][0][j]==0 || qmat16_chroma[qscale][0][j]==128*256) qmat16_chroma[qscale][0][j]=128*256-1;
> +
> +            qmat16_luma  [qscale][1][j] = ROUNDED_DIV(lbias<<(16-QUANT_BIAS_SHIFT), qmat16_luma  [qscale][0][j]);
> +            qmat16_chroma[qscale][1][j] = ROUNDED_DIV(cbias<<(16-QUANT_BIAS_SHIFT), qmat16_chroma[qscale][0][j]);
> +        }
> +    }
> +
> +    ctx->q_scale_matrixl   = qmat_luma;
> +    ctx->q_scale_matrixc   = qmat_chroma;
> +    ctx->q_scale_matrixl16 = qmat16_luma;
> +    ctx->q_scale_matrixc16 = qmat16_chroma;
> +    return 0;
> +}

duplicate of convert_matrix()

[...]
> +    dsputil_init(&ctx->m.dsp, avctx);
> +    ff_init_scantable(ctx->m.dsp.idct_permutation, &ctx->m.intra_scantable, ff_zigzag_direct);
> +
> +    ctx->m.avctx = avctx;
> +    ctx->m.dct_quantize = dct_quantize_c;

does this disable the optimized quantize funcions?
what about quantize_trelis? (no its no reason to delay the patch, making
trellis work could very well be in a later patch, iam just asking)

[...]

> +            if (ctx->slice_size[mb_y]&31)
> +                ctx->slice_size[mb_y] += 32-(ctx->slice_size[mb_y]&31);

+31)&~31;

[...]
> +static int dnxhd_mb_var_thread(AVCodecContext *avctx, void *arg)
> +{
> +    DNXHDEncContext *ctx = arg;
> +    int mb_y, mb_x;
> +    for (mb_y = ctx->m.start_mb_y; mb_y < ctx->m.end_mb_y; mb_y++) {
> +        for (mb_x = 0; mb_x < ctx->m.mb_width; mb_x++) {
> +            unsigned mb  = mb_y * ctx->m.mb_width + mb_x;
> +            uint8_t *pix = ctx->thread[0]->src_y + ((mb_y<<4) * ctx->m.linesize) + (mb_x<<4);
> +            int sum      = ctx->m.dsp.pix_sum(pix, ctx->m.linesize);
> +            int varc     = (ctx->m.dsp.pix_norm1(pix, ctx->m.linesize) - (((unsigned)(sum*sum))>>8)+500+128)>>8;

could you explain what that +500 does?

[...]
> +            if (bits&31) //padding
> +                bits += 32-(bits&31);

+31)&~31;

[...]
> +    if (ctx->interlaced && ctx->cur_field) {
> +        ctx->src_y = ctx->picture->data[0] + ctx->picture->linesize[0];
> +        ctx->src_u = ctx->picture->data[1] + ctx->picture->linesize[1];
> +        ctx->src_v = ctx->picture->data[2] + ctx->picture->linesize[2];
> +    } else {
> +        ctx->src_y = ctx->picture->data[0];
> +        ctx->src_u = ctx->picture->data[1];
> +        ctx->src_v = ctx->picture->data[2];
> +    }

id write

for(i=0; i<3; i++){
    ctx->src[i]= ctx->picture->data[i];
    if(ctx->interlaced && ctx->cur_field)
        ctx->src[i] +=ctx->picture->linesize[i];
}

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071001/f0a721a0/attachment.pgp>