[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Fri Apr 9 07:12:49 EEST 2021

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
> Sent: 2021å¹´4æœˆ9æ—¥ 0:57
> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
> AV_FRAME_DATA_BOUNDING_BOXES
> 

First of all, thanks for the quick replies, I see, all the discussions/comments are to
make this patch better, thank you.

> >> >
> >> >> >> > +
> >> >> >> > +typedef struct AVBoundingBoxHeader {
> >> >> >> > +    /**
> >> >> >> > +     * Information about how the bounding box is generated.
> >> >> >> > +     * for example, the DNN model name.
> >> >> >> > +     */
> >> >> >> > +    char source[128];
> >> >> >> > +
> >> >> >> > +    /**
> >> >> >> > +     * The size of frame when it is detected.
> >> >> >> > +     */
> >> >> >> > +    int frame_width;
> >> >> >> > +    int frame_height;
> >> >> >> >
> >> >> >>
> >> >> >> Why? This side data is attached to AVFrames only, where we
> >> >> >> already have width and height.
> >> >> >>
> >> >> >
> >> >> > The detection result will be used by other filters, for example,
> >> >> > dnn_classify (see https://github.com/guoyejun/ffmpeg/tree/classify).
> >> >> >
> >> >> > The filter dnn_detect detects all the objects (cat, dog, person ...) in a
> >> >> > frame, while dnn_classify classifies one detected object (for example,
> >> person)
> >> >> > for its attribute (for example, emotion, etc.)
> >> >> >
> >> >> > The filter dnn_classify have to check if the frame size is changed after
> >> >> > it is detected, to handle the below filter chain:
> >> >> > dnn_detect -> scale -> dnn_classify
> >> >> >
> >> >>
> >> >> This doesn't look good. Why is dnn_classify needing to know
> >> >> the original frame size at all?
> >> >>
> >> >
> >> > For example, the original size of the frame is 100*100, and dnn_detect
> >> > detects a face at place (10, 10) -> (30, 40), such data will be saved in
> >> > AVBoundingBox.top/left/right/bottom.
> >> >
> >> > Then, the frame is scaled into 50*50.
> >> >
> >> > Then, dnn_classify is used to analyze the emotion of the face, it needs to
> >> > know the frame size (100*100) when it is detected, otherwise, it does not
> >> > work with just (10,10), (30,40) and 50*50.
> >> >
> >>
> >> Why can't the scale filter also rescale this side data as well?
> >>
> >
> > I'm afraid that we could not make sure all such filters (including filters in the
> > future) to do the rescale. And in the previous discussion, I got to know that
> > 'many other existing side-data types are invalidated by scaling'.
> >
> > So, we need frame_width and frame_height here.
> >
> 
> No, you don't. You just need to make sure filters which change resolution
> or do cropping also change the side data parameters.
> It's called maintainership. As-is, this won't even work with cropping,
> only with basic aspect ratio preserving scaling.
> For the lack of a better term, this is a hack.

As discussed in previous email, for the frame size change case, dnn_classify
(and other filters which use the detection result, for example drawbox) can
just output a warning message to tell user what happens, and don't do the
classification, otherwise, it will give a wrong/weird result which makes the
user confused.

> 
> I would accept just specifying that if the frame dimensions are
> altered in any way, the side-data is no longer valid and it's up
> to users to figure that out by out of bound coordinates.
> This is what we currently do with video_enc_params.

frame_width/frame_height is not perfect (for the cases such as: scale down
+ crop + scale up to the same size), but it provides more info than the checking
of 'out of bound coordinates'. There are many other possible issues when the
coordinates are within the frame.

If we think we'd better not let user get more info from the warning message,
I'm ok to remove them.

I'll remove them if there's another comment supporting the removal, and
there's no objection.

> 
> 
> >> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
> >> >> >> > index a5ed91b20a..41e22de02a 100644
> >> >> >> > --- a/libavutil/frame.h
> >> >> >> > +++ b/libavutil/frame.h
> >> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
> >> >> >> >  * Must be present for every frame which should have film grain
> >> applied.
> >> >> >> >  */
> >> >> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
> >> >> >> > +
> >> >> >> > +    /**
> >> >> >> > +     * Bounding boxes for object detection and classification, the
> >> data is
> >> >> a
> >> >> >> AVBoundingBoxHeader
> >> >> >> > +     * followed with an array of AVBoudingBox, and
> >> >> >> AVBoundingBoxHeader.nb_bboxes is the number
> >> >> >> > +     * of array element.
> >> >> >> > +     */
> >> >> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
> >> >> >> >  };
> >> >> >> >
> >> >> >>
> >> >> >> Finally, why call it a Bounding Box? It's not descriptive at all.
> >> >> >> How about "Object Classification"? It makes much more sense, it's
> >> >> >> exactly what this is. So AVObjectClassification, AVObjectClassification,
> >> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
> >> >> >>
> >> >> >
> >> >> > In object detection papers, bounding box is usually used.
> >> >> > We'd better use the same term, imho, thanks.
> >> >> >
> >> >>
> >> >> Not in this case, API users won't have any idea what this is or what
> >> >> it's for. This is user-facing code after all.
> >> >> Papers in fields can get away with overloading language, but we're
> >> >> trying to make a concise API. Object classification makes sense
> >> >> because this is exactly what this is.
> >> >>
> >> >
> >> > The term bounding box is dominating the domain, for example, even
> >> > HEVC spec uses this term, see page 317 of
> >> >
> >>
> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I!!P
> >> DF-E&type=items
> >> >
> >> > also copy some here for your convenient.
> >> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
> >> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
> >> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
> >> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
> >> >
> >> > I would prefer to use bounding box.
> >> >
> >>
> >> It's for an entirely different thing, and like I said, it's just an overloaded
> >> language because they can get away. We're trying to be generic.
> >> This side data is for detecting _and_ classifying objects. In fact, most of
> >> the structure is dedicated towards classifying. If you'd like users to actually
> >> use this, give it a good name and don't leave them guessing what this
> >> structure is by throwing vague jargon some other paper or spec has
> >> because it's close enough.
> >>
> >
> > all the people in the domain accepts bounding box, they can understand this
> > struct name easily and clearly, they might be bothered if we use another
> name.
> >
> > btw, AVObjectClassification confuses people who just want object detection.
> >
> 
> As I said, the name "bounding box" makes no sense once it gets overloaded
> with object classification. 

dnn_detect creates an array of 'bounding box' for all detected objects, and
dnn_classify assigns attributes for a set of bounding boxes (with same object
id). 'bounding box' serves both detection and classification properly.

> Object classification is still the main use of the filters,
> because the original proposal was to have all of this info be ffmpeg-private,
> which would forbid simple object detection.

The original proposal is to add it as side data which is ffmpeg-public, and then,
we spent much time discussing/trying with ffmpeg-private as an temporary
method, and since it is not good to be temporary, we now switch back to
ffmpeg-public.

During the whole period, we don't have any intention to 
'forbid simple object detection', not quite understand your point here.

> So I still maintain this should be called "Object classification". I'd accept
> "Object detection" as well, but definitely not "bounding box".

imho, ' Object detection' and ' Object classification' are worse, they just
describe one aspect of the struct. The users might just use filter dnn_detect,
they might use filters dnn_detect + dnn_classify.

> 
> Since the decision was made to make the side data public, we have to make
> very sure it contains no hacks or is impossible to extend, since we don't want
> to have an
> "AV_SIDE_DATA_OBJECT_CLASSIFICATION_VERSION_2_SORRY_WE_SCREWED_
> UP"
> faster than you can say "Recursive cascade correlation artificial neural
> networks".

sorry, not quite understand your point here.

'bounding box' is designed for general purpose to contain the info for
detection/classification. It doesn't matter which DNN model is used, it doesn't
matter if a traditional algorithm (non-dnn) is used.

I'm open to use a better name. And bounding box is the best one for me till now.
Everyone in the domain knows the exact meaning of bounding box without
extra explanation. This word has been extended/evolved with such meaning in
this domain.