[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Lynne dev at lynne.ee
Fri Apr 9 18:17:43 EEST 2021


Apr 9, 2021, 17:10 by yejun.guo at intel.com:

>
>
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> Guo, Yejun
>> Sent: 2021年4月9日 20:57
>> To: FFmpeg development discussions and patches
>> <ffmpeg-devel at ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> AV_FRAME_DATA_BOUNDING_BOXES
>>
>>
>>
>> > -----Original Message-----
>> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> > Lynne
>> > Sent: 2021年4月9日 18:03
>> > To: FFmpeg development discussions and patches
>> > <ffmpeg-devel at ffmpeg.org>
>> > Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> > AV_FRAME_DATA_BOUNDING_BOXES
>> >
>> > Apr 9, 2021, 06:12 by yejun.guo at intel.com:
>> >
>> > >
>> > >
>> > >> -----Original Message-----
>> > >> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf
>> Of
>> > Lynne
>> > >> Sent: 2021年4月9日 0:57
>> > >> To: FFmpeg development discussions and patches
>> > <ffmpeg-devel at ffmpeg.org>
>> > >> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> > >> AV_FRAME_DATA_BOUNDING_BOXES
>> > >>
>> > >
>> > > First of all, thanks for the quick replies, I see, all the
>> > discussions/comments are to
>> > > make this patch better, thank you.
>> > >
>> > >> >> >
>> > >> >> >> >> > +
>> > >> >> >> >> > +typedef struct AVBoundingBoxHeader {
>> > >> >> >> >> > +    /**
>> > >> >> >> >> > +     * Information about how the bounding box is
>> > generated.
>> > >> >> >> >> > +     * for example, the DNN model name.
>> > >> >> >> >> > +     */
>> > >> >> >> >> > +    char source[128];
>> > >> >> >> >> > +
>> > >> >> >> >> > +    /**
>> > >> >> >> >> > +     * The size of frame when it is detected.
>> > >> >> >> >> > +     */
>> > >> >> >> >> > +    int frame_width;
>> > >> >> >> >> > +    int frame_height;
>> > >> >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> Why? This side data is attached to AVFrames only, where we
>> > >> >> >> >> already have width and height.
>> > >> >> >> >>
>> > >> >> >> >
>> > >> >> >> > The detection result will be used by other filters, for example,
>> > >> >> >> > dnn_classify (see
>> > https://github.com/guoyejun/ffmpeg/tree/classify).
>> > >> >> >> >
>> > >> >> >> > The filter dnn_detect detects all the objects (cat, dog,
>> person ...)
>> > in a
>> > >> >> >> > frame, while dnn_classify classifies one detected object (for
>> > example,
>> > >> >> person)
>> > >> >> >> > for its attribute (for example, emotion, etc.)
>> > >> >> >> >
>> > >> >> >> > The filter dnn_classify have to check if the frame size is
>> changed
>> > after
>> > >> >> >> > it is detected, to handle the below filter chain:
>> > >> >> >> > dnn_detect -> scale -> dnn_classify
>> > >> >> >> >
>> > >> >> >>
>> > >> >> >> This doesn't look good. Why is dnn_classify needing to know
>> > >> >> >> the original frame size at all?
>> > >> >> >>
>> > >> >> >
>> > >> >> > For example, the original size of the frame is 100*100, and
>> > dnn_detect
>> > >> >> > detects a face at place (10, 10) -> (30, 40), such data will be saved
>> in
>> > >> >> > AVBoundingBox.top/left/right/bottom.
>> > >> >> >
>> > >> >> > Then, the frame is scaled into 50*50.
>> > >> >> >
>> > >> >> > Then, dnn_classify is used to analyze the emotion of the face, it
>> > needs to
>> > >> >> > know the frame size (100*100) when it is detected, otherwise, it
>> > does not
>> > >> >> > work with just (10,10), (30,40) and 50*50.
>> > >> >> >
>> > >> >>
>> > >> >> Why can't the scale filter also rescale this side data as well?
>> > >> >>
>> > >> >
>> > >> > I'm afraid that we could not make sure all such filters (including
>> filters
>> > in the
>> > >> > future) to do the rescale. And in the previous discussion, I got to
>> > know that
>> > >> > 'many other existing side-data types are invalidated by scaling'.
>> > >> >
>> > >> > So, we need frame_width and frame_height here.
>> > >> >
>> > >>
>> > >> No, you don't. You just need to make sure filters which change
>> > resolution
>> > >> or do cropping also change the side data parameters.
>> > >> It's called maintainership. As-is, this won't even work with cropping,
>> > >> only with basic aspect ratio preserving scaling.
>> > >> For the lack of a better term, this is a hack.
>> > >>
>> > >
>> > > As discussed in previous email, for the frame size change case,
>> > dnn_classify
>> > > (and other filters which use the detection result, for example drawbox)
>> > can
>> > > just output a warning message to tell user what happens, and don't do
>> > the
>> > > classification, otherwise, it will give a wrong/weird result which makes
>> > the
>> > > user confused.
>> > >
>> > >>
>> > >> I would accept just specifying that if the frame dimensions are
>> > >> altered in any way, the side-data is no longer valid and it's up
>> > >> to users to figure that out by out of bound coordinates.
>> > >> This is what we currently do with video_enc_params.
>> > >>
>> > >
>> > > frame_width/frame_height is not perfect (for the cases such as: scale
>> > down
>> > > + crop + scale up to the same size), but it provides more info than the
>> > checking
>> > > of 'out of bound coordinates'. There are many other possible issues
>> > when the
>> > > coordinates are within the frame.
>> > >
>> > > If we think we'd better not let user get more info from the warning
>> > message,
>> > > I'm ok to remove them.
>> > >
>> > > I'll remove them if there's another comment supporting the removal,
>> and
>> > > there's no objection.
>> > >
>> >
>> > We definitely shouldn't include variables in public API structs
>> > that only serve to print a warning if they don't match.
>>
>> Not just 'print a warning', it also impacts the behavior of dnn_classify.
>>
>> > Especially one that's fragile and flawed like this one.
>> > Users should know that scaling or altering a frame could break
>> > this side data, and filters could detect obvious out of bounds
>> > results and report them.
>>
>> I'll remove them since it is user's responsibility.
>>
>> >
>> > In the meantime, the main scaling and cropping filters could
>> > receive separate patches to preserve metadata at a later data.
>> > This is how the avframe cropping field work - they're just metadata,
>> > and cropping/scaling filters update those.
>> >
>> >
>> > >> >> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
>> > >> >> >> >> > index a5ed91b20a..41e22de02a 100644
>> > >> >> >> >> > --- a/libavutil/frame.h
>> > >> >> >> >> > +++ b/libavutil/frame.h
>> > >> >> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
>> > >> >> >> >> >  * Must be present for every frame which should have
>> film
>> > grain
>> > >> >> applied.
>> > >> >> >> >> >  */
>> > >> >> >> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
>> > >> >> >> >> > +
>> > >> >> >> >> > +    /**
>> > >> >> >> >> > +     * Bounding boxes for object detection and
>> > classification, the
>> > >> >> data is
>> > >> >> >> a
>> > >> >> >> >> AVBoundingBoxHeader
>> > >> >> >> >> > +     * followed with an array of AVBoudingBox, and
>> > >> >> >> >> AVBoundingBoxHeader.nb_bboxes is the number
>> > >> >> >> >> > +     * of array element.
>> > >> >> >> >> > +     */
>> > >> >> >> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
>> > >> >> >> >> >  };
>> > >> >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> Finally, why call it a Bounding Box? It's not descriptive at all.
>> > >> >> >> >> How about "Object Classification"? It makes much more
>> sense,
>> > it's
>> > >> >> >> >> exactly what this is. So AVObjectClassification,
>> > AVObjectClassification,
>> > >> >> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
>> > >> >> >> >>
>> > >> >> >> >
>> > >> >> >> > In object detection papers, bounding box is usually used.
>> > >> >> >> > We'd better use the same term, imho, thanks.
>> > >> >> >> >
>> > >> >> >>
>> > >> >> >> Not in this case, API users won't have any idea what this is or
>> > what
>> > >> >> >> it's for. This is user-facing code after all.
>> > >> >> >> Papers in fields can get away with overloading language, but
>> > we're
>> > >> >> >> trying to make a concise API. Object classification makes sense
>> > >> >> >> because this is exactly what this is.
>> > >> >> >>
>> > >> >> >
>> > >> >> > The term bounding box is dominating the domain, for example,
>> > even
>> > >> >> > HEVC spec uses this term, see page 317 of
>> > >> >> >
>> > >> >>
>> > >>
>> >
>> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I
>> > !!P
>> > >> >> DF-E&type=items
>> > >> >> >
>> > >> >> > also copy some here for your convenient.
>> > >> >> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
>> > >> >> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
>> > >> >> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
>> > >> >> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
>> > >> >> >
>> > >> >> > I would prefer to use bounding box.
>> > >> >> >
>> > >> >>
>> > >> >> It's for an entirely different thing, and like I said, it's just an
>> > overloaded
>> > >> >> language because they can get away. We're trying to be generic.
>> > >> >> This side data is for detecting _and_ classifying objects. In fact,
>> most
>> > of
>> > >> >> the structure is dedicated towards classifying. If you'd like users to
>> > actually
>> > >> >> use this, give it a good name and don't leave them guessing what
>> > this
>> > >> >> structure is by throwing vague jargon some other paper or spec
>> has
>> > >> >> because it's close enough.
>> > >> >>
>> > >> >
>> > >> > all the people in the domain accepts bounding box, they can
>> > understand this
>> > >> > struct name easily and clearly, they might be bothered if we use
>> > another
>> > >> name.
>> > >> >
>> > >> > btw, AVObjectClassification confuses people who just want object
>> > detection.
>> > >> >
>> > >>
>> > >> As I said, the name "bounding box" makes no sense once it gets
>> > overloaded
>> > >> with object classification.
>> > >>
>> > >
>> > > dnn_detect creates an array of 'bounding box' for all detected objects,
>> > and
>> > > dnn_classify assigns attributes for a set of bounding boxes (with same
>> > object
>> > > id). 'bounding box' serves both detection and classification properly.
>> > >
>> > >
>> > >> Object classification is still the main use of the filters,
>> > >> because the original proposal was to have all of this info be
>> > ffmpeg-private,
>> > >> which would forbid simple object detection.
>> > >>
>> > >
>> > > The original proposal is to add it as side data which is ffmpeg-public,
>> and
>> > then,
>> > > we spent much time discussing/trying with ffmpeg-private as an
>> > temporary
>> > > method, and since it is not good to be temporary, we now switch back
>> to
>> > > ffmpeg-public.
>> > >
>> > > During the whole period, we don't have any intention to
>> > > 'forbid simple object detection', not quite understand your point here.
>> > >
>> > >
>> > >> So I still maintain this should be called "Object classification". I'd
>> accept
>> > >> "Object detection" as well, but definitely not "bounding box".
>> > >>
>> > >
>> > > imho, ' Object detection' and ' Object classification' are worse, they just
>> > > describe one aspect of the struct. The users might just use filter
>> > dnn_detect,
>> > > they might use filters dnn_detect + dnn_classify.
>> > >
>> >
>> > The whole reason why we have this side data is to both detect
>> > _and_ classify. Keyword being _both_. Hence object detection
>> > and object classification are much better names.
>> > I am opposed to merging this without a name change.
>>
>> I understand it is not a good name with codec background, but it is a good
>> (possible best) name with object detection background.
>>
>> To make things move forward, I'll change the name to 'object detection'
>> for both dnn_detect and dnn_classify, they would be:
>> AV_FRAME_DATA_OJBECT_DETECTION
>> AVObjectDetection
>> AVObjectDetectionHeader
>>
>
> I tried to change the code with this new name, but feel like it is not
> straight-forward during the coding.
>
> Pedro, who initialized DNN module, just commented on the naming, maybe
> I can just wait for more several days to get more comments on the naming,
> thanks.
>

If you want to move forward with this patch, please send a new one
with the field and name changes.


More information about the ffmpeg-devel mailing list