[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Lynne dev at lynne.ee
Fri Apr 9 13:03:15 EEST 2021


Apr 9, 2021, 06:12 by yejun.guo at intel.com:

>
>
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
>> Sent: 2021年4月9日 0:57
>> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> AV_FRAME_DATA_BOUNDING_BOXES
>>
>
> First of all, thanks for the quick replies, I see, all the discussions/comments are to
> make this patch better, thank you.
>
>> >> >
>> >> >> >> > +
>> >> >> >> > +typedef struct AVBoundingBoxHeader {
>> >> >> >> > +    /**
>> >> >> >> > +     * Information about how the bounding box is generated.
>> >> >> >> > +     * for example, the DNN model name.
>> >> >> >> > +     */
>> >> >> >> > +    char source[128];
>> >> >> >> > +
>> >> >> >> > +    /**
>> >> >> >> > +     * The size of frame when it is detected.
>> >> >> >> > +     */
>> >> >> >> > +    int frame_width;
>> >> >> >> > +    int frame_height;
>> >> >> >> >
>> >> >> >>
>> >> >> >> Why? This side data is attached to AVFrames only, where we
>> >> >> >> already have width and height.
>> >> >> >>
>> >> >> >
>> >> >> > The detection result will be used by other filters, for example,
>> >> >> > dnn_classify (see https://github.com/guoyejun/ffmpeg/tree/classify).
>> >> >> >
>> >> >> > The filter dnn_detect detects all the objects (cat, dog, person ...) in a
>> >> >> > frame, while dnn_classify classifies one detected object (for example,
>> >> person)
>> >> >> > for its attribute (for example, emotion, etc.)
>> >> >> >
>> >> >> > The filter dnn_classify have to check if the frame size is changed after
>> >> >> > it is detected, to handle the below filter chain:
>> >> >> > dnn_detect -> scale -> dnn_classify
>> >> >> >
>> >> >>
>> >> >> This doesn't look good. Why is dnn_classify needing to know
>> >> >> the original frame size at all?
>> >> >>
>> >> >
>> >> > For example, the original size of the frame is 100*100, and dnn_detect
>> >> > detects a face at place (10, 10) -> (30, 40), such data will be saved in
>> >> > AVBoundingBox.top/left/right/bottom.
>> >> >
>> >> > Then, the frame is scaled into 50*50.
>> >> >
>> >> > Then, dnn_classify is used to analyze the emotion of the face, it needs to
>> >> > know the frame size (100*100) when it is detected, otherwise, it does not
>> >> > work with just (10,10), (30,40) and 50*50.
>> >> >
>> >>
>> >> Why can't the scale filter also rescale this side data as well?
>> >>
>> >
>> > I'm afraid that we could not make sure all such filters (including filters in the
>> > future) to do the rescale. And in the previous discussion, I got to know that
>> > 'many other existing side-data types are invalidated by scaling'.
>> >
>> > So, we need frame_width and frame_height here.
>> >
>>
>> No, you don't. You just need to make sure filters which change resolution
>> or do cropping also change the side data parameters.
>> It's called maintainership. As-is, this won't even work with cropping,
>> only with basic aspect ratio preserving scaling.
>> For the lack of a better term, this is a hack.
>>
>
> As discussed in previous email, for the frame size change case, dnn_classify
> (and other filters which use the detection result, for example drawbox) can
> just output a warning message to tell user what happens, and don't do the
> classification, otherwise, it will give a wrong/weird result which makes the
> user confused.
>
>>
>> I would accept just specifying that if the frame dimensions are
>> altered in any way, the side-data is no longer valid and it's up
>> to users to figure that out by out of bound coordinates.
>> This is what we currently do with video_enc_params.
>>
>
> frame_width/frame_height is not perfect (for the cases such as: scale down
> + crop + scale up to the same size), but it provides more info than the checking
> of 'out of bound coordinates'. There are many other possible issues when the
> coordinates are within the frame.
>
> If we think we'd better not let user get more info from the warning message,
> I'm ok to remove them.
>
> I'll remove them if there's another comment supporting the removal, and
> there's no objection.
>

We definitely shouldn't include variables in public API structs
that only serve to print a warning if they don't match.
Especially one that's fragile and flawed like this one.
Users should know that scaling or altering a frame could break
this side data, and filters could detect obvious out of bounds
results and report them.

In the meantime, the main scaling and cropping filters could
receive separate patches to preserve metadata at a later data.
This is how the avframe cropping field work - they're just metadata,
and cropping/scaling filters update those.


>> >> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
>> >> >> >> > index a5ed91b20a..41e22de02a 100644
>> >> >> >> > --- a/libavutil/frame.h
>> >> >> >> > +++ b/libavutil/frame.h
>> >> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
>> >> >> >> >  * Must be present for every frame which should have film grain
>> >> applied.
>> >> >> >> >  */
>> >> >> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
>> >> >> >> > +
>> >> >> >> > +    /**
>> >> >> >> > +     * Bounding boxes for object detection and classification, the
>> >> data is
>> >> >> a
>> >> >> >> AVBoundingBoxHeader
>> >> >> >> > +     * followed with an array of AVBoudingBox, and
>> >> >> >> AVBoundingBoxHeader.nb_bboxes is the number
>> >> >> >> > +     * of array element.
>> >> >> >> > +     */
>> >> >> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
>> >> >> >> >  };
>> >> >> >> >
>> >> >> >>
>> >> >> >> Finally, why call it a Bounding Box? It's not descriptive at all.
>> >> >> >> How about "Object Classification"? It makes much more sense, it's
>> >> >> >> exactly what this is. So AVObjectClassification, AVObjectClassification,
>> >> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
>> >> >> >>
>> >> >> >
>> >> >> > In object detection papers, bounding box is usually used.
>> >> >> > We'd better use the same term, imho, thanks.
>> >> >> >
>> >> >>
>> >> >> Not in this case, API users won't have any idea what this is or what
>> >> >> it's for. This is user-facing code after all.
>> >> >> Papers in fields can get away with overloading language, but we're
>> >> >> trying to make a concise API. Object classification makes sense
>> >> >> because this is exactly what this is.
>> >> >>
>> >> >
>> >> > The term bounding box is dominating the domain, for example, even
>> >> > HEVC spec uses this term, see page 317 of
>> >> >
>> >>
>> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I!!P
>> >> DF-E&type=items
>> >> >
>> >> > also copy some here for your convenient.
>> >> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
>> >> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
>> >> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
>> >> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
>> >> >
>> >> > I would prefer to use bounding box.
>> >> >
>> >>
>> >> It's for an entirely different thing, and like I said, it's just an overloaded
>> >> language because they can get away. We're trying to be generic.
>> >> This side data is for detecting _and_ classifying objects. In fact, most of
>> >> the structure is dedicated towards classifying. If you'd like users to actually
>> >> use this, give it a good name and don't leave them guessing what this
>> >> structure is by throwing vague jargon some other paper or spec has
>> >> because it's close enough.
>> >>
>> >
>> > all the people in the domain accepts bounding box, they can understand this
>> > struct name easily and clearly, they might be bothered if we use another
>> name.
>> >
>> > btw, AVObjectClassification confuses people who just want object detection.
>> >
>>
>> As I said, the name "bounding box" makes no sense once it gets overloaded
>> with object classification.
>>
>
> dnn_detect creates an array of 'bounding box' for all detected objects, and
> dnn_classify assigns attributes for a set of bounding boxes (with same object
> id). 'bounding box' serves both detection and classification properly.
>
>
>> Object classification is still the main use of the filters,
>> because the original proposal was to have all of this info be ffmpeg-private,
>> which would forbid simple object detection.
>>
>
> The original proposal is to add it as side data which is ffmpeg-public, and then,
> we spent much time discussing/trying with ffmpeg-private as an temporary
> method, and since it is not good to be temporary, we now switch back to
> ffmpeg-public.
>
> During the whole period, we don't have any intention to 
> 'forbid simple object detection', not quite understand your point here.
>
>
>> So I still maintain this should be called "Object classification". I'd accept
>> "Object detection" as well, but definitely not "bounding box".
>>
>
> imho, ' Object detection' and ' Object classification' are worse, they just
> describe one aspect of the struct. The users might just use filter dnn_detect,
> they might use filters dnn_detect + dnn_classify.
>

The whole reason why we have this side data is to both detect
_and_ classify. Keyword being _both_. Hence object detection
and object classification are much better names.
I am opposed to merging this without a name change.


>> Since the decision was made to make the side data public, we have to make
>> very sure it contains no hacks or is impossible to extend, since we don't want
>> to have an
>> "AV_SIDE_DATA_OBJECT_CLASSIFICATION_VERSION_2_SORRY_WE_SCREWED_
>> UP"
>> faster than you can say "Recursive cascade correlation artificial neural
>> networks".
>>
>
> sorry, not quite understand your point here.
>

The point of this paragraph was to highlight the need to have other
people look at the struct rather than just you. It's called peer review.
It wasn't anything about the name at all.


More information about the ffmpeg-devel mailing list