[FFmpeg-devel] [PATCH] RoQ video encoder (take 4)
Michael Niedermayer
michaelni
Thu Jun 14 03:39:18 CEST 2007
Hi
On Wed, Jun 13, 2007 at 12:09:41PM +0200, Vitor wrote:
> Hi
>
> Michael Niedermayer wrote:
> >Hi
> >
>
> I've followed your suggestion in the message
> http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-June/030406.html
> and it did decrease the complexity of the patch.
>
> I think that this encoder is still a bit verbose compared to the average
> ffmpeg encoder (it goes through all the blocks to do motion detection,
> then go through all of it again to find best codebook, etc), but fixing
> this will not give a noticeable speed gain nor make the code simpler
> (just smaller) so I don't think it is worth the work.
making it smaller is also a big advantage, less source to read, less memory
less code cache space needed ...
[...]
> >>+static void enlarge_roq_mb4(uint8_t base[3][16], uint8_t u[3][64])
> >>+{
> >>+ int x,y,cp;
> >>+
> >>+ for(y=0;y<8;y++)
> >>+ for(x=0;x<8;x++)
> >>+ for(cp=0;cp<3;cp++)
> >>+ u[cp][y*8+x] = base[cp][(y/2)*4 + (x/2)];
> >>
> >
> >the for(cp loop should be the outermost IMHO
> >also
> >
> >enlarge_roq_mb4(uint8_t base[3][4][4], uint8_t u[3][8][8])
> >
> >u[cp][y][x] = base[cp][y>>1][x>>1];
> >seems more readable
> >
> >
>
> I don't know if I understand your suggestion, but if I do, I really
> don't like passing a [3][16] vector to a function that expects a
> [3][4][4]. I agree it is more readable, but every time I look at it I
> think I spotted an error...
you understood my suggestion correctly, anyway this is not important, we
can leave this code as is if you prefer
>
> >>+
> >>+/**
> >>+ * Template code to find the codebook with the lowest distortion from an
> >>image
> >>+ */
> >>+#define GET_LOWEST_CB_ERR(FUNCT, DIM, ERR_COMMAND) \
> >>+static int FUNCT(uint8_t cluster[3][DIM], uint8_t cb[][3][DIM], int
> >>numCB, int *outIndex) \
> >>+{ \
> >>+ int diff; \
> >>+ int pick=0, lDiff; \
> >>+ int i=0; \
> >>+\
> >>+ lDiff = INT_MAX; \
> >>+\
> >>+ /* Diff against the others */ \
> >>+ for (i=0; i<numCB; i++) { \
> >>+ diff = ERR_COMMAND \
> >>+ if (diff < lDiff) { \
> >>+ lDiff = diff; \
> >>+ pick = i; \
> >>+ } \
> >>+ } \
> >>+\
> >>+ *outIndex = pick; \
> >>+ return lDiff; \
> >>+}
> >>+
> >>+
> >>+GET_LOWEST_CB_ERR(index_mb2, 4, squared_diff_mb2(cluster, cb[i]);)
> >>+GET_LOWEST_CB_ERR(index_mb4, 16, squared_diff_mb4(cluster, cb[i]);)
> >>+GET_LOWEST_CB_ERR(index_mb8, 64, squared_diff_mb8(cluster, cb[i]);)
> >>
> >
> >same question here, why is this not a normal funcion?
> >
> >
>
> I can't see a good way to avoid passing the parameter cb[][3*DIM]
> without making cb a one-dimensional vector (which I think is less readable).
i dont really agree, a 1D array looks perefectly fine to me
this macro OTOH though does not look readable at all ...
[...]
> +#define CHROMA_BIAS 2
why 2?
does it look better? or is it just because it gives chroma in 4:4:4 the same
weight as luma?
[...]
> +static void unpack_roq_qcell(uint8_t cb2[][4*3], roq_qcell *qcell, uint8_t u[4*4*3])
> +{
> + uint8_t base[4][4*3];
> + int n,cp,i;
> + static int offsets[4] = {0, 2, 8, 10};
> +
> + for(i=0;i<4;i++)
> + for(cp=0; cp<3; cp++)
> + for(n=0; n<4; n++)
> + base[i][cp+3*n]=cb2[qcell->idx[i]][cp+3*n];
for(i=0;i<4;i++)
memcpy(base[i], cb2[qcell->idx[i]], 3*4);
[...]
> +// FIXME Could use DSPContext.sse, but it is not so speed critical (used
> +// just for motion estimation).
> +static int block_sse(uint8_t **buf1, uint8_t **buf2, int x1, int y1, int x2, int y2,
> + int stride, int size)
> +{
> + int i, j, k, bias;
> + int sse=0;
> +
> + for (k=0; k<3; k++) {
> + bias = (k ? CHROMA_BIAS : 1);
> + for (i=0; i<size; i++)
indention looks wrong
[...]
> +static int eval_motion_dist(RoqContext *enc, int x, int y, int mx, int my,
> + int size)
> +{
> + if (mx < -7 || mx > 7)
> + return INT_MAX;
> +
> + if (my < -7 || my > 7)
> + return INT_MAX;
> +
> + if ((x + mx - size < 0) || (x + mx > enc->width - size))
> + return INT_MAX;
> +
> + if ((y + my - size < 0) || (y + my > enc->height - size))
> + return INT_MAX;
is (y + my - size < 0) correct? shouldnt it be y + my < 0
> +
> +
> + return block_sse(enc->frame_to_enc->data, enc->last_frame->data, x, y,
> + x+mx, y+my, enc->y_stride, size);
> +}
mx+=x;
my+=y;
if ((unsigned)mx > enc->width - size || (unsigned)my > enc->height - size)
return INT_MAX;
return block_sse(enc->frame_to_enc->data, enc->last_frame->data, x, y,
mx, my, enc->y_stride, size);
also block_sse could be split like:
block_sse(dst[0], src[0], x, y, mx, my, enc->y_stride, size)
+ CHROMA_BIAS*block_sse(dst[1], src[1], x, y, mx, my, enc->c_stride, size)
+ CHROMA_BIAS*block_sse(dst[2], src[2], x, y, mx, my, enc->c_stride, size);
[...]
> +/**
> + * Initializes cel evaluators and sets their source coordinates
> + */
> +static int create_cel_evals(RoqContext *enc, roq_tempdata_t *tempData)
> +{
> + int n,x,y;
> +
> + tempData->cel_evals = av_malloc(enc->width*enc->height/64 * sizeof(cel_evaluation_t));
> + if (!tempData->cel_evals)
> + return -1;
> +
> + memset(tempData->codebooks.usedCB2, 0, 256*sizeof(int));
> + memset(tempData->codebooks.usedCB4, 0, 256*sizeof(int));
> +
> + tempData->mainChunkSize = 0;
> + tempData->used_option[0]=
> + tempData->used_option[1]=
> + tempData->used_option[2]=
> + tempData->used_option[3]= 0;
the whole tempData is memset(0) before this function so the above zeroing is
not needed
[...]
> + tempData->cel_evals[n].sourceX = x;
> + tempData->cel_evals[n++].sourceY = y;
> +
> + tempData->cel_evals[n].sourceX = x+8;
> + tempData->cel_evals[n++].sourceY = y;
> +
> + tempData->cel_evals[n].sourceX = x;
> + tempData->cel_evals[n++].sourceY = y+8;
> +
> + tempData->cel_evals[n].sourceX = x+8;
> + tempData->cel_evals[n++].sourceY = y+8;
for(i=0; i<4; i++){
tempData->cel_evals[n ].sourceX = x * (i&1)*8;
tempData->cel_evals[n++].sourceY = y * (i&2)*4;
}
[...]
> +/**
> + * Get macroblocks from parts of the image
> + */
> +static void get_frame_mb(AVFrame *frame, int x, int y, uint8_t mb[], int dim)
> +{
> + int i, j, cp;
> + int stride = frame->linesize[0];
> +
> + for (i=0; i<dim; i++)
> + for (j=0; j<dim; j++)
> + for (cp=0; cp<3; cp++)
> + mb[i*3*dim+j*3+cp] = frame->data[cp][(y+i)*stride+x+j];
> +}
why is the planar data converted to packed?
[...]
> +#define EVAL_MOTION(MOTION) \
> + do { \
> + diff = eval_motion_dist(enc, j, i, (MOTION).d[0], \
> + (MOTION).d[1], blocksize); \
> + \
> + if (diff < lowestdiff) { \
> + lowestdiff = diff; \
> + bestpick = MOTION; \
> + } \
> + } while(0)
this can be an inline function
> +
> +static void motion_search(RoqContext *enc, int blocksize)
> +{
> + static motion_vect offsets[9] = {
static const
> + {{ 0, 0}},
> + {{ 0,-1}},
> + {{ 0, 1}},
> + {{-1, 0}},
> + {{ 1, 0}},
> + {{-1, 1}},
> + {{ 1,-1}},
> + {{-1,-1}},
> + {{ 1, 1}},
> + };
i think the { 0, 0} makes no sense as it has already been tested
[...]
> +/**
> + * Gets distortion for all options available to a subcel
> + */
> +static void gather_data_for_subcel(subcel_evaluation_t *subcel, int x,
> + int y, RoqContext *enc, roq_tempdata_t *tempData)
> +{
> + uint8_t mb4[4*4*3];
> + uint8_t mb2[2*2*3];
> + int cluster_index;
> + int i, best_dist;
> +
> + static int ssc_offsets[4][2] = {
static const
> + {0,0},
> + {2,0},
> + {0,2},
> + {2,2},
> + };
> +
> + static int bitsUsed[4] = {2, 10, 10, 34};
static const
[...]
> + best_dist = INT_MAX;
> + for (i=0; i<4; i++)
> + if (subcel->eval_dist[i]+enc->lambda*bitsUsed[i] < best_dist) {
> + subcel->best_coding = i;
> + subcel->best_bit_use = bitsUsed[i];
> + best_dist = subcel->eval_dist[i]+enc->lambda*bitsUsed[i];
> + }
> +
> +}
is the empty line between the 2 } intended?
[...]
> +
> +/**
> + * Gathers SE data for all feasible possibilities
> + */
> +static int gather_cel_data(RoqContext *enc, roq_tempdata_t *tempData)
> +{
> + int max;
> + cel_evaluation_t *cel_evals;
> +
> + max = enc->width*enc->height/64;
> + cel_evals = tempData->cel_evals;
> +
> + while (max--) {
> + gather_data_for_cel(cel_evals, enc, tempData);
> + cel_evals++;
> + }
isnt
for(i=0; i<enc->width*enc->height/64; i++)
gather_data_for_cel(cel_evals+i, enc, tempData);
more readable?
[...]
> + idx = 0;
> + for (i=0; i<256; i++) {
> + if (tempData->codebooks.usedCB2[i] || tempData->numCB4 == 256) {
> + tempData->i2f2[i] = idx;
> + tempData->f2i2[idx] = i;
> + idx++;
> + }
> + }
> + tempData->numCB2 = idx;
> +
> +
> + if(tempData->numCB4 == 256)
> + tempData->numCB2 = 256;
this if() does seem unneeded
btw, what is the purpose of the numCB4 == 256 special case above this if()?
i also think that it would be better to drop more than just the unused cbs
but rather calculate the precisse rate distortion (bits*lambda+sse)
difference which would result from droping a specific cb and drop it if
the rate distortion decreases
though this is just an idea for a future patch ...
> +
> +}
> +
> +static void write_codebooks(RoqContext *enc, roq_tempdata_t *tempData)
> +{
> + int i, j;
> +
> + /* Create codebook chunk */
> + if (tempData->numCB2) {
> + bytestream_put_le16(&enc->out_buf, RoQ_QUAD_CODEBOOK);
> +
> + bytestream_put_le32(&enc->out_buf, tempData->numCB2*6 +
> + tempData->numCB4*4);
> +
> + bytestream_put_byte(&enc->out_buf, tempData->numCB4);
> + bytestream_put_byte(&enc->out_buf, tempData->numCB2);
> +
> + for (i=0; i<tempData->numCB2; i++) {
> + for (j=0; j<4; j++)
> + bytestream_put_byte(&enc->out_buf, enc->cb2x2[tempData->f2i2[i]].y[j]);
> +
> + bytestream_put_byte(&enc->out_buf, enc->cb2x2[tempData->f2i2[i]].u);
> + bytestream_put_byte(&enc->out_buf, enc->cb2x2[tempData->f2i2[i]].v);
> + }
> +
> + for (i=0; i<tempData->numCB4; i++)
> + for (j=0; j<4; j++)
> + bytestream_put_byte(&enc->out_buf, tempData->i2f2[enc->cb4x4[tempData->f2i4[i]].idx[j]]);
> +
> + }
> +}
a uint8_t **outp= &enc->out_buf;
would make the code slightly more readable IMHO
> +
> +/* NOTE: Typecodes must be spooled AFTER arguments!! */
> +#define SPOOL_ARGUMENT(arg) \
> +do {\
> + argumentSpool[argumentSpoolLength++] = (uint8_t)(arg);\
> +} while(0)
why hide this behind a macro?
because the variable names are too long maybe? :)
> +
> +#define SPOOL_MOTION(dx, dy) \
> +do {\
> + uint8_t arg, ax, ay;\
> + ax = 8 - (uint8_t)(dx);\
> + ay = 8 - (uint8_t)(dy);\
> + arg = (uint8_t)(((ax&15)<<4) | (ay&15));\
> + SPOOL_ARGUMENT(arg);\
> +} while(0)
there is absolutely no reason why this is a macro and not a function
> +
> +
> +
> +#define SPOOL_TYPECODE(type) \
> +do {\
> + int a;\
> + typeSpool |= ((type) & 3) << (14 - typeSpoolLength);\
> + typeSpoolLength += 2;\
> + if (typeSpoolLength == 16) {\
> + bytestream_put_le16(&enc->out_buf, typeSpool); \
> + for (a=0; a<argumentSpoolLength; a++)\
> + bytestream_put_byte(&enc->out_buf, argumentSpool[a]); \
> + typeSpoolLength = 0;\
> + typeSpool = 0;\
> + argumentSpoolLength = 0;\
> + }\
> +} while(0)
instead of this complicated fifo spool thingy, why not
use
bytestream_put_byte(&out, arg); instead of
SPOOL_ARGUMENT(arg)
and
write_typecode(){
spool |= (type & 3) << (14 - spool_length);
spool_length += 2;
if (spool_length == 16){
AV_WL16(typep, spool);
spool= 0;
typep= out;
out+=2;
spool_length=0;
}
}
[...]
> + static int subcelOffsets[4][2] = {
> + {0,0},
> + {4,0},
> + {0,4},
> + {4,4},
> + };
> +
> + static int subsubcelOffsets[4][2] = {
> + {0,0},
> + {2,0},
> + {0,2},
> + {2,2},
> + };
subcelOffsets[x][0] = 4*(x&1)
subcelOffsets[x][1] = 2*(x&2)
so these tables arent really usefull IMHO, they just make the code harder
to read
> +
> + if (tempData->used_option[RoQ_ID_CCC]%2)
> + tempData->mainChunkSize+=8; //FIXME
> +
> + /* Write the video chunk header */
> + bytestream_put_le16(&enc->out_buf, RoQ_QUAD_VQ);
> + bytestream_put_le32(&enc->out_buf, tempData->mainChunkSize/8);
> + bytestream_put_byte(&enc->out_buf, 0x0);
> + bytestream_put_byte(&enc->out_buf, 0x0);
> +
> + start_buf=enc->out_buf;
> + for (i=0; i<numBlocks; i++) {
> + eval = tempData->cel_evals + i;
> +
> + x = eval->sourceX;
> + y = eval->sourceY;
> +
> + switch (eval->best_coding) {
> + case RoQ_ID_MOT:
> + SPOOL_TYPECODE(RoQ_ID_MOT);
> + break;
> +
> + case RoQ_ID_FCC:
> + SPOOL_MOTION(eval->motionX, eval->motionY);
> + SPOOL_TYPECODE(RoQ_ID_FCC);
> + ff_apply_motion_8x8(enc, x, y, eval->motionX, eval->motionY);
> + break;
> +
> + case RoQ_ID_SLD:
> + SPOOL_ARGUMENT(tempData->i2f4[eval->cbEntry]);
> + SPOOL_TYPECODE(RoQ_ID_SLD);
> +
> + qcell = enc->cb4x4 + eval->cbEntry;
> + ff_apply_vector_4x4(enc, x , y , enc->cb2x2 + qcell->idx[0]);
> + ff_apply_vector_4x4(enc, x+4, y , enc->cb2x2 + qcell->idx[1]);
> + ff_apply_vector_4x4(enc, x , y+4, enc->cb2x2 + qcell->idx[2]);
> + ff_apply_vector_4x4(enc, x+4, y+4, enc->cb2x2 + qcell->idx[3]);
> + break;
> +
> + case RoQ_ID_CCC:
> + SPOOL_TYPECODE(RoQ_ID_CCC);
> + for (j=0; j<4; j++) {
> + subX = subcelOffsets[j][0] + x;
> + subY = subcelOffsets[j][1] + y;
> +
> + switch(eval->subCels[j].best_coding) {
> + case RoQ_ID_MOT:
> + SPOOL_TYPECODE(RoQ_ID_MOT);
> + break;
> +
> + case RoQ_ID_FCC:
> + SPOOL_MOTION(eval->subCels[j].motionX,
> + eval->subCels[j].motionY);
> +
> + SPOOL_TYPECODE(RoQ_ID_FCC);
> + ff_apply_motion_4x4(enc, subX, subY,
> + eval->subCels[j].motionX,
> + eval->subCels[j].motionY);
> +
> + break;
> +
> + case RoQ_ID_SLD:
> + SPOOL_ARGUMENT(tempData->i2f4[eval->subCels[j].cbEntry]);
> + SPOOL_TYPECODE(RoQ_ID_SLD);
> +
> + qcell = enc->cb4x4 + eval->subCels[j].cbEntry;
> +
> + ff_apply_vector_2x2(enc, subX , subY ,
> + enc->cb2x2 + qcell->idx[0]);
> + ff_apply_vector_2x2(enc, subX+2, subY ,
> + enc->cb2x2 + qcell->idx[1]);
> + ff_apply_vector_2x2(enc, subX , subY+2,
> + enc->cb2x2 + qcell->idx[2]);
> + ff_apply_vector_2x2(enc, subX+2, subY+2,
> + enc->cb2x2 + qcell->idx[3]);
> + break;
> +
> + case RoQ_ID_CCC:
> + for (k=0; k<4; k++) {
> + SPOOL_ARGUMENT(tempData->i2f2[eval->subCels[j].subCels[k]]);
> + ff_apply_vector_2x2(enc,
> + subX+subsubcelOffsets[k][0],
> + subY+subsubcelOffsets[k][1],
> + enc->cb2x2 + eval->subCels[j].subCels[k]);
> + }
> + SPOOL_TYPECODE(RoQ_ID_CCC);
> + break;
> + }
SPOOL_TYPECODE(eval->subCels[j].best_coding);
[...]
> +/**
> + * Create a single YUV cell from a 2x2 section of the image
> + */
> +static inline void frame_block_to_cell(uint8_t *block, uint8_t **data,
> + int top, int left, int *stride)
> +{
> + int i, u=0, v=0;
> + static int offsets[4][2] = {{0,0},
> + {0,1},
> + {1,0},
> + {1,1}};
this can be replaced by i&1 / i&2
[...]
> + if (size == 4)
> + closest_cb = av_malloc(6*c_size*inputCount*sizeof(int));
> + else
> + closest_cb = tempdata->closest_cb2;
indention
> +
> + ff_init_elbg(points, 6*c_size, inputCount, codebook, 256, 1, closest_cb, &enc->randctx);
> + ff_do_elbg(points, 6*c_size, inputCount, codebook, 256, 1, closest_cb, &enc->randctx);
the size of the codebook should be decided based on rate distortion
(bits per codebook is known and distortion is too during elbg)
this is for a seperate patch though if you want to try it
[...]
> + idx[j] = enc->cb4x4[i].idx[j];
unused
[...]
> +typedef struct {
> + int d[2];
> +} motion_vect;
why not drop this struct and use int [2] directly?
[...]
> + double lambda;
floating point code is very very bad as it means that the output is not binary
identical between architectures or compilers so regression tests wouldnt be
possible
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
If you really think that XML is the answer, then you definitly missunderstood
the question -- Attila Kinali
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070614/34788378/attachment.pgp>
More information about the ffmpeg-devel
mailing list