[FFmpeg-devel] [PATCH V2 08/10] libavutil: add side data AVDnnBoundingBox for dnn based detect/classify filters

Thu Feb 11 00:19:24 EET 2021

On 10/02/2021 09:34, Guo, Yejun wrote:
> Signed-off-by: Guo, Yejun <yejun.guo at intel.com>
> ---
>   doc/APIchanges       |  2 ++
>   libavutil/Makefile   |  1 +
>   libavutil/dnn_bbox.h | 68 ++++++++++++++++++++++++++++++++++++++++++++
>   libavutil/frame.c    |  1 +
>   libavutil/frame.h    |  7 +++++
>   libavutil/version.h  |  2 +-
>   6 files changed, 80 insertions(+), 1 deletion(-)
>   create mode 100644 libavutil/dnn_bbox.h

What is the intended consumer of this box information?  (Is there some other filter which will read these are do something with them, or some sort of user program?)

If there is no use in ffmpeg outside libavfilter then the header should probably be in libavfilter.

How tied is this to the DNN implementation, and hence the DNN name?  If someone made a standalone filter doing object detection by some other method, would it make sense for them to reuse this structure?

> diff --git a/libavutil/dnn_bbox.h b/libavutil/dnn_bbox.h
> new file mode 100644
> index 0000000000..50899c4486
> --- /dev/null
> +++ b/libavutil/dnn_bbox.h
> @@ -0,0 +1,68 @@
> +/*
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#ifndef AVUTIL_DNN_BBOX_H
> +#define AVUTIL_DNN_BBOX_H
> +
> +#include "rational.h"
> +
> +typedef struct AVDnnBoundingBox {
> +    /**
> +     * Must be set to the size of this data structure (that is,
> +     * sizeof(AVDnnBoundingBox)).
> +     */
> +    uint32_t self_size;
> +
> +    /**
> +     * Object detection is usually applied to a smaller image that
> +     * is scaled down from the original frame.
> +     * width and height are attributes of the scaled image, in pixel.
> +     */
> +    int model_input_width;
> +    int model_input_height;

Other than to interpret the distances below, what will the user do with this information?  (Alternatively: why not map the distances back onto the original frame size?)

> +
> +    /**
> +     * Distance in pixels from the top edge of the scaled image to top
> +     * and bottom, and from the left edge of the scaled image to left and
> +     * right, defining the bounding box.
> +     */
> +    int top;
> +    int left;
> +    int bottom;
> +    int right;
> +
> +    /**
> +     * Detect result
> +     */
> +    int detect_label;

How does a user interpret this label?  Is it from some known enum?

> +    AVRational detect_conf;

"conf"... idence?  A longer name and a descriptive comment might help.

> +
> +    /**
> +     * At most 4 classifications based on the detected bounding box.
> +     * For example, we can get max 4 different attributes with 4 different
> +     * DNN models on one bounding box.
> +     * classify_count is zero if no classification.
> +     */
> +#define AV_NUM_BBOX_CLASSIFY 4
> +    uint32_t classify_count;
> +    int classify_labels[AV_NUM_BBOX_CLASSIFY];
> +    AVRational classify_confs[AV_NUM_BBOX_CLASSIFY];

Same comment on these.

> +} AVDnnBoundingBox;
> +
> +#endif
> diff --git a/libavutil/frame.c b/libavutil/frame.c
> index eab51b6a32..4308507827 100644
> --- a/libavutil/frame.c
> +++ b/libavutil/frame.c
> @@ -852,6 +852,7 @@ const char *av_frame_side_data_name(enum AVFrameSideDataType type)
>       case AV_FRAME_DATA_VIDEO_ENC_PARAMS:            return "Video encoding parameters";
>       case AV_FRAME_DATA_SEI_UNREGISTERED:            return "H.26[45] User Data Unregistered SEI message";
>       case AV_FRAME_DATA_FILM_GRAIN_PARAMS:           return "Film grain parameters";
> +    case AV_FRAME_DATA_DNN_BBOXES:                  return "DNN bounding boxes";
>       }
>       return NULL;
>   }
> diff --git a/libavutil/frame.h b/libavutil/frame.h
> index 1aeafef6de..a4dcfd27c9 100644
> --- a/libavutil/frame.h
> +++ b/libavutil/frame.h
> @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
>        * Must be present for every frame which should have film grain applied.
>        */
>       AV_FRAME_DATA_FILM_GRAIN_PARAMS,
> +
> +    /**
> +     * Bounding box generated by dnn based filters for object detection and classification,
> +     * the data is an array of AVDnnBoudingBox, the number of array element is implied by
> +     * AVFrameSideData.size / AVDnnBoudingBox.self_size.
> +     */
> +    AV_FRAME_DATA_DNN_BBOXES,
>   };
>   
>   enum AVActiveFormatDescription {
- Mark