[FFmpeg-devel] [PATCH V2 08/10] libavutil: add side data AVDnnBoundingBox for dnn based detect/classify filters

Tue Feb 16 01:47:44 EET 2021

On 11/02/2021 08:15, Guo, Yejun wrote:
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Mark
>> Thompson
>> Sent: 2021年2月11日 6:19
>> To: ffmpeg-devel at ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH V2 08/10] libavutil: add side data
>> AVDnnBoundingBox for dnn based detect/classify filters
>>
>> On 10/02/2021 09:34, Guo, Yejun wrote:
>>> Signed-off-by: Guo, Yejun <yejun.guo at intel.com>
>>> ---
>>>    doc/APIchanges       |  2 ++
>>>    libavutil/Makefile   |  1 +
>>>    libavutil/dnn_bbox.h | 68
>> ++++++++++++++++++++++++++++++++++++++++++++
>>>    libavutil/frame.c    |  1 +
>>>    libavutil/frame.h    |  7 +++++
>>>    libavutil/version.h  |  2 +-
>>>    6 files changed, 80 insertions(+), 1 deletion(-)
>>>    create mode 100644 libavutil/dnn_bbox.h
>>
>> What is the intended consumer of this box information?  (Is there some other
>> filter which will read these are do something with them, or some sort of user
>> program?)
>>
>> If there is no use in ffmpeg outside libavfilter then the header should probably
>> be in libavfilter.
> 
> 
> Thanks for the feedback.
> 
> For most case, other filters will use this box information,
> for example, a classify filter will recognize the car number after the car plate is detected,
> another filter can apply 'beauty' if a face is detected, and updated drawbox filter (in plan)
> can draw the box for visualization, and a new filter such as bbox_to_roi can be added
> to apply roi encoding for the detected result.
> 
> It is possible that some others will use it,
> for example, the new codec is adding AI labels and so libavcodec might need it in the future,
> and a user program might do something special such as:
> 1. use libavcodec to decode
> 2. use filter detect
> 3. write his own code to handle the detect result
> 
> As the first step, how about to put it in the libavfilter (so do not expose it
> at API level and we are free to change it when needed)? And we can move
> it to libavutil once it is required.

Sure.

>> How tied is this to the DNN implementation, and hence the DNN name?  If
>> someone made a standalone filter doing object detection by some other
>> method, would it make sense for them to reuse this structure?
> 
> Yes, this structure is general, I add dnn prefix because of two reasons:
> 1. There's already bounding box in libavfilter/bbox.h, see below, it's simple and we could
> not reuse it, so we need a new name.
> typedef struct FFBoundingBox {
>      int x1, x2, y1, y2;
> } FFBoundingBox;

Right, really this is just the return type for the internal ff_calculate_bounding_box() function - if you want to reuse the name externally then it would be fine to rename the existing stuff to get it out of the way.

> 2. DNN is currently the dominate method for object detection.

Unless your ID values or something else about the output are DNN-specific then I'm not really seeing the attraction of associating them with the DNN name for external use.  If a user wants to detect some objects in an image and then do something with the result then maybe they know they are using DNN for first step, but they won't care about where the result came from after that.

> How about to use 'struct BoudingBox' when it is under libavfilter (added into current file bbox.h),
> and rename to 'AVBoudingBox' when it is needed to move to libavutil?

Hmm.  What are we actually representing here?  It's not just a bounding box because the structure is also attaching a specific meaning to it - that we think the bounding box corresponds to the location of some particular object in the image data.

Given that, maybe we would be better with something like "ObjectLocation"?  (Just an idea, feel free to ignore me and suggest something else.)

>>> diff --git a/libavutil/dnn_bbox.h b/libavutil/dnn_bbox.h new file mode
>>> 100644 index 0000000000..50899c4486
>>> --- /dev/null
>>> +++ b/libavutil/dnn_bbox.h
>>> +
>>> +    /**
>>> +     * Object detection is usually applied to a smaller image that
>>> +     * is scaled down from the original frame.
>>> +     * width and height are attributes of the scaled image, in pixel.
>>> +     */
>>> +    int model_input_width;
>>> +    int model_input_height;
>>
>> Other than to interpret the distances below, what will the user do with this
>> information?  (Alternatively: why not map the distances back onto the
>> original frame size?)
> 
> My idea was: if the original frame is scaled sometime later, the side data (bbox)
> keeps correct without any modification.
> 
> If we use the distance on the original frame size, shall we say the behavior
> is undefined when it is scaled sometime later?

This probably shouldn't be a specific concern - many other existing side-data types are invalidated by scaling (or other non-scaling filtering transformations, for that matter).

Unless there is an independent reason to keep the model size (which there might well be, I don't know) then I think it would be better to use the actual frame size.

>>> +
>>> +    /**
>>> +     * Distance in pixels from the top edge of the scaled image to top
>>> +     * and bottom, and from the left edge of the scaled image to left and
>>> +     * right, defining the bounding box.
>>> +     */
>>> +    int top;
>>> +    int left;
>>> +    int bottom;
>>> +    int right;
>>> +
>>> +    /**
>>> +     * Detect result
>>> +     */
>>> +    int detect_label;
>>
>> How does a user interpret this label?  Is it from some known enum?
> 
> The mappings between label_id (int) and label_name (string) are not part of
> the model file, it is usually another file provided together with the model file.
> The mappings are specific with the model file.
> 
> My idea was to provide the mapping file only when it is needed, for example,
> when draw the box and text with filter drawbox/drawtext to visualize the bounding box.

So every component involved would need to be separately supplied with the same mapping information?

> If we don't care the size in side_data, we can add label name into this struct,
> for example.
> #define BBOX_LABEL_NAME_MAX_LENGTH 32
> int detect_label_id;
> char detect_label_name[BBOX_LABEL_NAME_MAX_LENGTH+1];
> int classify_label_ids[AV_NUM_BBOX_CLASSIFY];
> char classify_label_names[AV_NUM_BBOX_CLASSIFY][ BBOX_LABEL_NAME_MAX_LENGTH+1]

Assuming the overhead isn't excessive and there isn't other useful metadata in the model-information file, supplying the names inline does seem likely to be less confusing for users so they don't have to be careful with the mapping information.

(On top of that - do the IDs do anything at all if you have the names?)

Thanks,

- Mark