[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Thu Apr 8 08:35:41 EEST 2021

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
> Sent: 2021å¹´4æœˆ8æ—¥ 12:14
> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
> AV_FRAME_DATA_BOUNDING_BOXES
> 
> Apr 8, 2021, 04:48 by yejun.guo at intel.com:
> 
> >> > +
> >> > +#ifndef AVUTIL_BOUNDINGBOX_H
> >> > +#define AVUTIL_BOUNDINGBOX_H
> >> > +
> >> > +#include "rational.h"
> >> > +#include "avassert.h"
> >> > +#include "frame.h"
> >> > +
> >> > +typedef struct AVBoundingBox {
> >> > +    /**
> >> > +     * Distance in pixels from the top edge of the frame to top
> >> > +     * and bottom, and from the left edge of the frame to left and
> >> > +     * right, defining the bounding box.
> >> > +     */
> >> > +    int top;
> >> > +    int left;
> >> > +    int bottom;
> >> > +    int right;
> >> >
> >>
> >> Why not x, y, w, h?
> >> It makes more sense, all of coordinates go like this.
> >>
> >
> > We want to keep it consistent with 'struct AVRegionOfInterest',
> > we also have a plan to add a filter which converts from bounding box
> > to RegionOfInterest which impacts the encoder.
> >
> 
> Not a good idea. Region of interest is only useful to indicate a
> single region, while this API describes multiple regions
> within the frame. The video_enc_params API follows the
> x, y, w, h because it's the easiest to work with, so I'm sure
> it's well worth going to such coordinates.
> 

RegionOfInterest is similar with boundingbox, it is used for multiple
regions in an array, see AV_FRAME_DATA_REGIONS_OF_INTEREST.

anyway, I'm ok to change to x,y,w,h.

btw, do we need to change to x,y,w,h for RegionOfInterest? Is such
change allowed after current major version change?

> >> > +
> >> > +typedef struct AVBoundingBoxHeader {
> >> > +    /**
> >> > +     * Information about how the bounding box is generated.
> >> > +     * for example, the DNN model name.
> >> > +     */
> >> > +    char source[128];
> >> > +
> >> > +    /**
> >> > +     * The size of frame when it is detected.
> >> > +     */
> >> > +    int frame_width;
> >> > +    int frame_height;
> >> >
> >>
> >> Why? This side data is attached to AVFrames only, where we
> >> already have width and height.
> >>
> >
> > The detection result will be used by other filters, for example,
> > dnn_classify (see https://github.com/guoyejun/ffmpeg/tree/classify).
> >
> > The filter dnn_detect detects all the objects (cat, dog, person ...) in a
> > frame, while dnn_classify classifies one detected object (for example, person)
> > for its attribute (for example, emotion, etc.)
> >
> > The filter dnn_classify have to check if the frame size is changed after
> > it is detected, to handle the below filter chain:
> > dnn_detect -> scale -> dnn_classify
> >
> 
> This doesn't look good. Why is dnn_classify needing to know
> the original frame size at all?

For example, the original size of the frame is 100*100, and dnn_detect
detects a face at place (10, 10) -> (30, 40), such data will be saved in
AVBoundingBox.top/left/right/bottom.

Then, the frame is scaled into 50*50.

Then, dnn_classify is used to analyze the emotion of the face, it needs to
know the frame size (100*100) when it is detected, otherwise, it does not
work with just (10,10), (30,40) and 50*50.

> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
> >> > index a5ed91b20a..41e22de02a 100644
> >> > --- a/libavutil/frame.h
> >> > +++ b/libavutil/frame.h
> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
> >> >  * Must be present for every frame which should have film grain applied.
> >> >  */
> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
> >> > +
> >> > +    /**
> >> > +     * Bounding boxes for object detection and classification, the data is
> a
> >> AVBoundingBoxHeader
> >> > +     * followed with an array of AVBoudingBox, and
> >> AVBoundingBoxHeader.nb_bboxes is the number
> >> > +     * of array element.
> >> > +     */
> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
> >> >  };
> >> >
> >>
> >> Finally, why call it a Bounding Box? It's not descriptive at all.
> >> How about "Object Classification"? It makes much more sense, it's
> >> exactly what this is. So AVObjectClassification, AVObjectClassification,
> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
> >>
> >
> > In object detection papers, bounding box is usually used.
> > We'd better use the same term, imho, thanks.
> >
> 
> Not in this case, API users won't have any idea what this is or what
> it's for. This is user-facing code after all.
> Papers in fields can get away with overloading language, but we're
> trying to make a concise API. Object classification makes sense
> because this is exactly what this is.

The term bounding box is dominating the domain, for example, even
HEVC spec uses this term, see page 317 of
https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I!!PDF-E&type=items

also copy some here for your convenient.
ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)

I would prefer to use bounding box.