[FFmpeg-devel] [PATCH] DNxHD/VC-3 encoder
Michael Niedermayer
michaelni
Thu Oct 4 01:26:15 CEST 2007
Hi
On Wed, Oct 03, 2007 at 10:45:04PM +0200, Baptiste Coudurier wrote:
> Hi Michael, thanks for the review,
>
> Michael Niedermayer wrote:
> > On Mon, Sep 24, 2007 at 01:00:12PM +0200, Baptiste Coudurier wrote:
> >
> >>Hi Michael,
> >>
> >>Michael Niedermayer wrote:
> >>
> >>>Hi
> >>>
> >>>On Fri, Sep 07, 2007 at 08:05:28PM +0200, Michael Niedermayer wrote:
> >>>
> >>>>Hi
> >>>>
> >>>>On Fri, Sep 07, 2007 at 09:17:49AM -0600, Loren Merritt wrote:
> >>>>
> >>>>>On Fri, 7 Sep 2007, Michael Niedermayer wrote:
> >>>>>
> >>>>>
> >>>>>>when people start using qsort you know theres something wrong in their
> >>>>>>encoder (or at least not optimal)
> >>>>>>
> >>>>>>so, lets first check if i understand the constraints
> >>>>>>* its possible to change qscale per macroblock and it doesnt cost anything
> >>>>>>* theres a fixed number of bits available for each frame (pre mpeg1 design)
> >>>>>>
> >>>>>>in that case the optimal solution (though not fastest) is to do a
> >>>>>>binary search over lambda (using the number of bits and the ones available
> >>>>>>to find the best lambda which results in a frame which is not too large)
> >>>>>>
> >>>>>>then for each such lambda, find the best qscale for each MB, again by
> >>>>>>binary search minimizing SSD + bits*lambda (that way we know the best
> >>>>>>qscales and how many bits the frame would need)
> >>>>>
> >>>>>Your algorithm produces the same decisions as the one with qsort. I don't
> >>>>>know which is faster.
> >>>>>
> >>>>>Consider the list of possible block encodings sorted by
> >>>>>delta_ssd/delta_bits. Optimizing for qps at any given lambda will produce
> >>>>>some prefix of that list. i.e. for any lambda, there is some
> >>>>>position in the list such that all preceding entries have ssd/bits
> >>>>>better than lambda, and all following entries have ssd/bits worse
> >>>>>than lambda. So instead of evaluating bits(lambda) as a black box and
> >>>>>searching for a specified values of bits, construct the function
> >>>>>lambda(bits) and evaluate it at the specified value of bits.
> >>>>
> >>>>yes, but as far as i can see thats not what the current code does, it just
> >>>>tries qscale and qscale+1 if it would try all qscale then yes they would
> >>>>be identical
> >>>
> >>>just to clarify my original review ...
> >>>i dont mind if the code uses qsort, what i was complaining about was that
> >>>it isnt finding the optimal (in the RD sense) solution
> >>>
> >>>if its changed to find the optimal one or its demonstrated that the
> >>>PSNR/bitrate difference is negligible then iam happy with that part
> >>>
> >>
> >>I tried lagrange. I keep the qsort rc because it is useable (> real time
> >>on quad core cpu with variance, -5 fps for ssd), rdo one is encoding at
> >>4 fps :/
> >>
> >>Im wondering if I did screwed up the implementation because I have an
> >>IMHO negligeable gain with rdo.
> >>
> >>I reused existing quant functions.
> >>
> >>Threaded functions must return int else you get a warning.
> >>
> >>About encode_dc, it will be complicated to reuse existing ones because:
> >>mpeg12 use some static tables and generation code.
> >>mjpeg use uint16_t for dc_codes.
> >>
> >>About mb_var_thread, existing function use current_picture and set
> >>variables I don't need, mb_mean, and use me field too ...
> >>
> >>Attached are some psnr results with one HD uncompressed sample I have.
> >>
> >>Any other optimization because what Loren suggested earlier ?
> >
> >
> > [...]
> >
> >>+static int dnxhd_init_qmat(DNXHDEncContext *ctx, int lbias, int cbias)
> >>+{
> >>+ int qscale, i;
> >>+ int (*qmat_luma )[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_luma));
> >>+ int (*qmat_chroma)[64] = av_mallocz(ctx->avctx->qmax * 64 * sizeof(*qmat_chroma));
> >>+ uint16_t (*qmat16_luma )[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_luma));
> >>+ uint16_t (*qmat16_chroma)[2][64] = av_mallocz(ctx->avctx->qmax * 64 * 2 * sizeof(*qmat16_chroma));
> >>+
> >>+ for (qscale = 1; qscale < ctx->avctx->qmax; qscale++) {
> >>+ for (i = 1; i < 64; i++) {
> >>+ int j = ff_zigzag_direct[i];
> >>+ qmat_luma [qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table-> luma_weigth[i]));
> >>+ qmat_chroma[qscale][j] = (int)((UINT64_C(4) << QMAT_SHIFT) / (qscale * ctx->cid_table->chroma_weigth[i]));
> >>+
> >>+ qmat16_luma [qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table-> luma_weigth[i]);
> >>+ qmat16_chroma[qscale][0][j] = (4 << QMAT_SHIFT_MMX) / (qscale * ctx->cid_table->chroma_weigth[i]);
> >>+
> >>+ if (qmat16_luma [qscale][0][j]==0 || qmat16_luma [qscale][0][j]==128*256) qmat16_luma [qscale][0][j]=128*256-1;
> >>+ if (qmat16_chroma[qscale][0][j]==0 || qmat16_chroma[qscale][0][j]==128*256) qmat16_chroma[qscale][0][j]=128*256-1;
> >>+
> >>+ qmat16_luma [qscale][1][j] = ROUNDED_DIV(lbias<<(16-QUANT_BIAS_SHIFT), qmat16_luma [qscale][0][j]);
> >>+ qmat16_chroma[qscale][1][j] = ROUNDED_DIV(cbias<<(16-QUANT_BIAS_SHIFT), qmat16_chroma[qscale][0][j]);
> >>+ }
> >>+ }
> >>+
> >>+ ctx->q_scale_matrixl = qmat_luma;
> >>+ ctx->q_scale_matrixc = qmat_chroma;
> >>+ ctx->q_scale_matrixl16 = qmat16_luma;
> >>+ ctx->q_scale_matrixc16 = qmat16_chroma;
> >>+ return 0;
> >>+}
> >
> >
> > duplicate of convert_matrix()
>
> Yes, there are some tiny differences, and I need more code to readjust
> tables, and its seems bigger than duplicating convert_matrix. What
> should I do ?
does it work with all DCTs? the faan dct, the slow and fast jpeg dcts?
ive not checked but i had the feeling it wont work with all
if it works i of course prefer the smaller and simpler solution
>
> > [...]
> >
> >>+ dsputil_init(&ctx->m.dsp, avctx);
> >>+ ff_init_scantable(ctx->m.dsp.idct_permutation, &ctx->m.intra_scantable, ff_zigzag_direct);
> >>+
> >>+ ctx->m.avctx = avctx;
> >>+ ctx->m.dct_quantize = dct_quantize_c;
> >
> >
> > does this disable the optimized quantize funcions?
>
> Not really, since MPV_common_init_mmx will reassign optimized function
> later. I changed it nonetheless.
>
> > what about quantize_trelis? (no its no reason to delay the patch, making
> > trellis work could very well be in a later patch, iam just asking)
>
> Well I'd like the encoder to be commited so I have real history on the
> work. I find easier to work when code is in svn (diffs).
>
> I had a quick look at trellis but I don't understand it completely yet
> (ac_esc_length not set by mpeg2 encoder for example, and levels taken
> into account seems different than dnxhd levels, once I get it I could as
> well do it for mjpeg)
trellis for mjpeg would be cool :)
maybe we could beat jpeg2000 that way ;)
[...]
>
> How does it look like ?
[...]
> Index: libavcodec/dnxhdenc.c
> ===================================================================
> --- libavcodec/dnxhdenc.c (revision 0)
> +++ libavcodec/dnxhdenc.c (revision 0)
[...]
> +//#define DEBUG
> +#define RC_VARIANCE 1 // use variance or ssd for fast rc
> +
> +#include "avcodec.h"
> +#include "dsputil.h"
> +#include "mpegvideo.h"
> +#include "dnxhddata.h"
> +
> +int dct_quantize_c(MpegEncContext *s, DCTELEM *block, int n, int qscale, int *overflow);
totally unrelated to your patch but this needs a ff_ prefix
[...]
> +typedef struct DNXHDEncContext {
> + MpegEncContext m; ///< Used for quantization dsp functions
> +
> + AVCodecContext *avctx;
> + AVFrame *picture;
> + int cid;
> + const CIDEntry *cid_table;
> + uint8_t *msips;
whats a msips ? this could benefit from a comment i think
[...]
> + ctx->m.avctx = avctx;
> + ctx->m.mb_intra = 1;
> + ctx->m.h263_aic = 1;
> +
> + MPV_common_init_mmx(&ctx->m);
> + if (!ctx->m.dct_quantize)
> + ctx->m.dct_quantize = dct_quantize_c;
iam an idiot, i was staring at this for maybe 2 minutes, somehow i knew
it was wrong i dont understand why i didnt realize it immedeatly ...
_mmx()
you dont even know if the CPU is a x86 ...
[...]
> + if (i&2) {
> + n = 1 + (i&1);
> + ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
> + ctx->m.q_intra_matrix = ctx->qmatrix_c;
> + } else {
> + n = 0;
> + ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
> + ctx->m.q_intra_matrix = ctx->qmatrix_l;
> + }
> +
[...]
> + if (i&2) {
> + n = 1 + (i&1);
> + ctx->m.q_intra_matrix16 = ctx->qmatrix_c16;
> + ctx->m.q_intra_matrix = ctx->qmatrix_c;
> + } else {
> + n = 0;
> + ctx->m.q_intra_matrix16 = ctx->qmatrix_l16;
> + ctx->m.q_intra_matrix = ctx->qmatrix_l;
> + }
duplicated
[...]
> +static int dnxhd_find_qscale(DNXHDEncContext *ctx)
> +{
> + int bits = 0;
> + int up_step = 1;
> + int down_step = 1;
> + int last_higher = 0;
> + int last_lower = INT_MAX;
> + int qscale;
> + int x, y;
> +
> + qscale = ctx->qscale;
> + for (;;) {
> + bits = 0;
> + ctx->qscale = qscale;
> + // XXX avoid recalculating bits
> + ctx->avctx->execute(ctx->avctx, dnxhd_calc_bits_thread, (void**)&ctx->thread[0], NULL, ctx->avctx->thread_count);
> + for (y = 0; y < ctx->m.mb_height; y++) {
> + for (x = 0; x < ctx->m.mb_width; x++)
> + bits += ctx->mb_rc[qscale][y*ctx->m.mb_width+x].bits;
> + bits = (bits+31)&~31; // padding
> + if (bits > ctx->frame_bits)
> + break;
> + }
> + //dprintf(ctx->avctx, "%d, qscale %d, bits %d, frame %d\n", ctx->avctx->frame_number, qscale, bits, ctx->frame_bits);
> + if (bits < ctx->frame_bits) {
> + if (qscale == 1)
> + break;
> + if (last_higher == qscale - 1) {
> + qscale = last_higher;
> + break;
> + }
> + if (qscale < last_lower)
> + last_lower = qscale;
> + qscale -= down_step++;
this looks like it can make qscale negative if bits stays small enough
> + up_step = 1;
> + } else {
> + if (last_lower == qscale + 1)
> + break;
> + if (qscale > last_higher)
> + last_higher = qscale;
> + qscale += up_step++;
> + down_step = 1;
> + if (qscale >= ctx->avctx->qmax)
> + return -1;
> + }
this is not optimal (speed wise) IMHO
if you have a lower and upper qscale then simple bisection should be
faster
[...]
> +static int dnxhd_encode_picture(AVCodecContext *avctx, unsigned char *buf, int buf_size, void *data)
> +{
> + DNXHDEncContext *ctx = avctx->priv_data;
> + AVFrame *input_frame = data;
> + int first_field = 1;
> + int offset, i, ret;
> +
> + if (buf_size < ctx->cid_table->frame_size) {
> + av_log(avctx, AV_LOG_ERROR, "output buffer is too small to compress picture\n");
> + return -1;
> + }
> +
> + av_picture_copy((AVPicture *)ctx->picture, data, PIX_FMT_YUV422P, avctx->width, avctx->height);
why do you copy the input image?
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Good people do not need laws to tell them to act responsibly, while bad
people will find a way around the laws. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071004/9e2bcedc/attachment.pgp>
More information about the ffmpeg-devel
mailing list