[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Sat Apr 10 12:52:51 EEST 2021

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Lynne
> Sent: 2021å¹´4æœˆ10æ—¥ 3:20
> To: FFmpeg development discussions and patches
> <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
> AV_FRAME_DATA_BOUNDING_BOXES
> 
> Apr 9, 2021, 20:08 by bygrandao at gmail.com:
> 
> > Em sex., 9 de abr. de 2021 Ã s 12:15, Lynne <dev at lynne.ee> escreveu:
> >
> >>
> >> Apr 9, 2021, 16:35 by bygrandao at gmail.com:
> >>
> >> > Em sex., 9 de abr. de 2021 Ã s 01:13, Guo, Yejun <yejun.guo at intel.com>
> escreveu:
> >> >
> >> >>
> >> >>
> >> >>
> >> >> > -----Original Message-----
> >> >> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On
> Behalf Of Lynne
> >> >> > Sent: 2021å¹´4æœˆ9æ—¥ 0:57
> >> >> > To: FFmpeg development discussions and patches
> <ffmpeg-devel at ffmpeg.org>
> >> >> > Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
> >> >> > AV_FRAME_DATA_BOUNDING_BOXES
> >> >> >
> >> >>
> >> >> First of all, thanks for the quick replies, I see, all the
> discussions/comments are to
> >> >> make this patch better, thank you.
> >> >>
> >> >> > >> >
> >> >> > >> >> >> > +
> >> >> > >> >> >> > +typedef struct AVBoundingBoxHeader {
> >> >> > >> >> >> > +    /**
> >> >> > >> >> >> > +     * Information about how the bounding box is
> generated.
> >> >> > >> >> >> > +     * for example, the DNN model name.
> >> >> > >> >> >> > +     */
> >> >> > >> >> >> > +    char source[128];
> >> >> > >> >> >> > +
> >> >> > >> >> >> > +    /**
> >> >> > >> >> >> > +     * The size of frame when it is detected.
> >> >> > >> >> >> > +     */
> >> >> > >> >> >> > +    int frame_width;
> >> >> > >> >> >> > +    int frame_height;
> >> >> > >> >> >> >
> >> >> > >> >> >>
> >> >> > >> >> >> Why? This side data is attached to AVFrames only, where
> we
> >> >> > >> >> >> already have width and height.
> >> >> > >> >> >>
> >> >> > >> >> >
> >> >> > >> >> > The detection result will be used by other filters, for
> example,
> >> >> > >> >> > dnn_classify (see
> https://github.com/guoyejun/ffmpeg/tree/classify).
> >> >> > >> >> >
> >> >> > >> >> > The filter dnn_detect detects all the objects (cat, dog,
> person ...) in a
> >> >> > >> >> > frame, while dnn_classify classifies one detected object (for
> example,
> >> >> > >> person)
> >> >> > >> >> > for its attribute (for example, emotion, etc.)
> >> >> > >> >> >
> >> >> > >> >> > The filter dnn_classify have to check if the frame size is
> changed after
> >> >> > >> >> > it is detected, to handle the below filter chain:
> >> >> > >> >> > dnn_detect -> scale -> dnn_classify
> >> >> > >> >> >
> >> >> > >> >>
> >> >> > >> >> This doesn't look good. Why is dnn_classify needing to know
> >> >> > >> >> the original frame size at all?
> >> >> > >> >>
> >> >> > >> >
> >> >> > >> > For example, the original size of the frame is 100*100, and
> dnn_detect
> >> >> > >> > detects a face at place (10, 10) -> (30, 40), such data will be
> saved in
> >> >> > >> > AVBoundingBox.top/left/right/bottom.
> >> >> > >> >
> >> >> > >> > Then, the frame is scaled into 50*50.
> >> >> > >> >
> >> >> > >> > Then, dnn_classify is used to analyze the emotion of the face,
> it needs to
> >> >> > >> > know the frame size (100*100) when it is detected, otherwise,
> it does not
> >> >> > >> > work with just (10,10), (30,40) and 50*50.
> >> >> > >> >
> >> >> > >>
> >> >> > >> Why can't the scale filter also rescale this side data as well?
> >> >> > >>
> >> >> > >
> >> >> > > I'm afraid that we could not make sure all such filters (including
> filters in the
> >> >> > > future) to do the rescale. And in the previous discussion, I got to
> know that
> >> >> > > 'many other existing side-data types are invalidated by scaling'.
> >> >> > >
> >> >> > > So, we need frame_width and frame_height here.
> >> >> > >
> >> >> >
> >> >> > No, you don't. You just need to make sure filters which change
> resolution
> >> >> > or do cropping also change the side data parameters.
> >> >> > It's called maintainership. As-is, this won't even work with
> cropping,
> >> >> > only with basic aspect ratio preserving scaling.
> >> >> > For the lack of a better term, this is a hack.
> >> >>
> >> >> As discussed in previous email, for the frame size change case,
> dnn_classify
> >> >> (and other filters which use the detection result, for example
> drawbox) can
> >> >> just output a warning message to tell user what happens, and don't
> do the
> >> >> classification, otherwise, it will give a wrong/weird result which
> makes the
> >> >> user confused.
> >> >>
> >> >> >
> >> >> > I would accept just specifying that if the frame dimensions are
> >> >> > altered in any way, the side-data is no longer valid and it's up
> >> >> > to users to figure that out by out of bound coordinates.
> >> >> > This is what we currently do with video_enc_params.
> >> >>
> >> >> frame_width/frame_height is not perfect (for the cases such as: scale
> down
> >> >> + crop + scale up to the same size), but it provides more info than
> the checking
> >> >> of 'out of bound coordinates'. There are many other possible issues
> when the
> >> >> coordinates are within the frame.
> >> >>
> >> >> If we think we'd better not let user get more info from the warning
> message,
> >> >> I'm ok to remove them.
> >> >>
> >> >> I'll remove them if there's another comment supporting the removal,
> and
> >> >> there's no objection.
> >> >>
> >> >> >
> >> >> >
> >> >> > >> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
> >> >> > >> >> >> > index a5ed91b20a..41e22de02a 100644
> >> >> > >> >> >> > --- a/libavutil/frame.h
> >> >> > >> >> >> > +++ b/libavutil/frame.h
> >> >> > >> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
> >> >> > >> >> >> >  * Must be present for every frame which should have
> film grain
> >> >> > >> applied.
> >> >> > >> >> >> >  */
> >> >> > >> >> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
> >> >> > >> >> >> > +
> >> >> > >> >> >> > +    /**
> >> >> > >> >> >> > +     * Bounding boxes for object detection and
> classification, the
> >> >> > >> data is
> >> >> > >> >> a
> >> >> > >> >> >> AVBoundingBoxHeader
> >> >> > >> >> >> > +     * followed with an array of AVBoudingBox, and
> >> >> > >> >> >> AVBoundingBoxHeader.nb_bboxes is the number
> >> >> > >> >> >> > +     * of array element.
> >> >> > >> >> >> > +     */
> >> >> > >> >> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
> >> >> > >> >> >> >  };
> >> >> > >> >> >> >
> >> >> > >> >> >>
> >> >> > >> >> >> Finally, why call it a Bounding Box? It's not descriptive at
> all.
> >> >> > >> >> >> How about "Object Classification"? It makes much more
> sense, it's
> >> >> > >> >> >> exactly what this is. So AVObjectClassification,
> AVObjectClassification,
> >> >> > >> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
> >> >> > >> >> >>
> >> >> > >> >> >
> >> >> > >> >> > In object detection papers, bounding box is usually used.
> >> >> > >> >> > We'd better use the same term, imho, thanks.
> >> >> > >> >> >
> >> >> > >> >>
> >> >> > >> >> Not in this case, API users won't have any idea what this is or
> what
> >> >> > >> >> it's for. This is user-facing code after all.
> >> >> > >> >> Papers in fields can get away with overloading language, but
> we're
> >> >> > >> >> trying to make a concise API. Object classification makes
> sense
> >> >> > >> >> because this is exactly what this is.
> >> >> > >> >>
> >> >> > >> >
> >> >> > >> > The term bounding box is dominating the domain, for
> example, even
> >> >> > >> > HEVC spec uses this term, see page 317 of
> >> >> > >> >
> >> >> > >>
> >> >> >
> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I
> !!P
> >> >> > >> DF-E&type=items
> >> >> > >> >
> >> >> > >> > also copy some here for your convenient.
> >> >> > >> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
> >> >> > >> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
> >> >> > >> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
> >> >> > >> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
> >> >> > >> >
> >> >> > >> > I would prefer to use bounding box.
> >> >> > >> >
> >> >> > >>
> >> >> > >> It's for an entirely different thing, and like I said, it's just an
> overloaded
> >> >> > >> language because they can get away. We're trying to be generic.
> >> >> > >> This side data is for detecting _and_ classifying objects. In fact,
> most of
> >> >> > >> the structure is dedicated towards classifying. If you'd like users
> to actually
> >> >> > >> use this, give it a good name and don't leave them guessing
> what this
> >> >> > >> structure is by throwing vague jargon some other paper or spec
> has
> >> >> > >> because it's close enough.
> >> >> > >>
> >> >> > >
> >> >> > > all the people in the domain accepts bounding box, they can
> understand this
> >> >> > > struct name easily and clearly, they might be bothered if we use
> another
> >> >> > name.
> >> >> > >
> >> >> > > btw, AVObjectClassification confuses people who just want object
> detection.
> >> >> > >
> >> >> >
> >> >> > As I said, the name "bounding box" makes no sense once it gets
> overloaded
> >> >> > with object classification.
> >> >>
> >> >> dnn_detect creates an array of 'bounding box' for all detected
> objects, and
> >> >> dnn_classify assigns attributes for a set of bounding boxes (with
> same object
> >> >> id). 'bounding box' serves both detection and classification properly.
> >> >>
> >> >>
> >> >> > Object classification is still the main use of the filters,
> >> >> > because the original proposal was to have all of this info be
> ffmpeg-private,
> >> >> > which would forbid simple object detection.
> >> >>
> >> >> The original proposal is to add it as side data which is ffmpeg-public,
> and then,
> >> >> we spent much time discussing/trying with ffmpeg-private as an
> temporary
> >> >> method, and since it is not good to be temporary, we now switch
> back to
> >> >> ffmpeg-public.
> >> >>
> >> >> During the whole period, we don't have any intention to
> >> >> 'forbid simple object detection', not quite understand your point
> here.
> >> >>
> >> >>
> >> >> > So I still maintain this should be called "Object classification". I'd
> accept
> >> >> > "Object detection" as well, but definitely not "bounding box".
> >> >>
> >> >> imho, ' Object detection' and ' Object classification' are worse, they
> just
> >> >> describe one aspect of the struct. The users might just use filter
> dnn_detect,
> >> >> they might use filters dnn_detect + dnn_classify.
> >> >>
> >> >>
> >> >> >
> >> >> > Since the decision was made to make the side data public, we have
> to make
> >> >> > very sure it contains no hacks or is impossible to extend, since we
> don't want
> >> >> > to have an
> >> >> >
> "AV_SIDE_DATA_OBJECT_CLASSIFICATION_VERSION_2_SORRY_WE_SCREWE
> D_
> >> >> > UP"
> >> >> > faster than you can say "Recursive cascade correlation artificial
> neural
> >> >> > networks".
> >> >>
> >> >> sorry, not quite understand your point here.
> >> >>
> >> >> 'bounding box' is designed for general purpose to contain the info
> for
> >> >> detection/classification. It doesn't matter which DNN model is used,
> it doesn't
> >> >> matter if a traditional algorithm (non-dnn) is used.
> >> >>
> >> >> I'm open to use a better name. And bounding box is the best one for
> me till now.
> >> >> Everyone in the domain knows the exact meaning of bounding box
> without
> >> >> extra explanation. This word has been extended/evolved with such
> meaning in
> >> >> this domain.
> >> >>
> >> > +1
> >> >
> >> > I think it is wise to use the name which is widely used in the field.
> >> >
> >>
> >> Way to go, ignoring half the exchange to voice an opinion after we
> >> came to an agreement.
> >>
> > Indeed that's the opposite, I gave my opinion *precisely* because I'm
> > following the whole discussion.
> > IMHO, ignoring a naming standard in a whole field is not wise.
> >
> 
> The convention is vague, broad, inappropriate, non-specific, incomplete,
> will be mistaken for something it is not, something that's not it not will be
> mistaken for what it is, and finally it doesn't cover the second part of what
> it will be used for - classification.

'bounding box' serves both detection and classification, the detection
detects objects and creates an array of bounding box, the classification
assigns an attribute for each bounding box.

'object detection' or 'object classification' just covers one part.

> We shouldn't follow a pseudo-naming convention if it clashes with other
> parts of our code and isn't descriptive to the degree I described.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".