[FFmpeg-devel] [PATCHv2] add signature filter for MPEG7 video signature

Mon Apr 11 14:30:37 CEST 2016

On Montag, 11. April 2016 12:57:17 CEST Michael Niedermayer wrote:
> On Mon, Apr 11, 2016 at 04:25:28AM +0200, Gerion Entrup wrote:
> > On Donnerstag, 7. April 2016 00:35:25 CEST Michael Niedermayer wrote:
> > > On Wed, Mar 30, 2016 at 11:02:36PM +0200, Gerion Entrup wrote:
> > > > On Mittwoch, 30. März 2016 22:57:47 CEST Gerion Entrup wrote:
> > > > > Add improved patch.
> > > > 
> > > > Rebased to master.
> > > > 
> > > 
> > > >  Changelog                      |    1 
> > > >  configure                      |    1 
> > > >  doc/filters.texi               |   70 +++
> > > >  libavfilter/Makefile           |    1 
> > > >  libavfilter/allfilters.c       |    1 
> > > >  libavfilter/signature.h        |  554 ++++++++++++++++++++++++++++++
> > > >  libavfilter/signature_lookup.c |  550 ++++++++++++++++++++++++++++++
> > > >  libavfilter/version.h          |    4 
> > > >  libavfilter/vf_signature.c     |  741 +++++++++++++++++++++++++++++++++++++++++
> > > >  9 files changed, 1921 insertions(+), 2 deletions(-)
> > > > 9192f27ded45c607996b4e266b6746f807c9a7fd  0001-add-signature-filter-for-MPEG7-video-signature.patch
> > > > From 9646ed6f0cf78356cf2914a60705c98d8f21fe8a Mon Sep 17 00:00:00 2001
> > > > From: Gerion Entrup <gerion.entrup at flump.de>
> > > > Date: Sun, 20 Mar 2016 11:10:31 +0100
> > > > Subject: [PATCH] add signature filter for MPEG7 video signature
> > > > 
> > > > This filter does not implement all features of MPEG7. Missing features:
> > > > - compression of signature files
> > > > - work only on (cropped) parts of the video
> > > > ---
> > > >  Changelog                      |   1 +
> > > >  configure                      |   1 +
> > > >  doc/filters.texi               |  70 ++++
> > > >  libavfilter/Makefile           |   1 +
> > > >  libavfilter/allfilters.c       |   1 +
> > > >  libavfilter/signature.h        | 554 ++++++++++++++++++++++++++++++
> > > >  libavfilter/signature_lookup.c | 550 ++++++++++++++++++++++++++++++
> > > >  libavfilter/version.h          |   4 +-
> > > >  libavfilter/vf_signature.c     | 741 +++++++++++++++++++++++++++++++++++++++++
> > > >  9 files changed, 1921 insertions(+), 2 deletions(-)
> > > >  create mode 100644 libavfilter/signature.h
> > > >  create mode 100644 libavfilter/signature_lookup.c
> > > >  create mode 100644 libavfilter/vf_signature.c
> > > > 
> > > > diff --git a/Changelog b/Changelog
> > > > index 7b0187d..8a2b7fd 100644
> > > > --- a/Changelog
> > > > +++ b/Changelog
> > > > @@ -18,6 +18,7 @@ version <next>:
> > > >  - coreimage filter (GPU based image filtering on OSX)
> > > >  - libdcadec removed
> > > >  - bitstream filter for extracting DTS core
> > > > +- MPEG-7 Video Signature filter
> > > >  
> > > >  version 3.0:
> > > >  - Common Encryption (CENC) MP4 encoding and decoding support
> > > > diff --git a/configure b/configure
> > > > index e550547..fe29827 100755
> > > > --- a/configure
> > > > +++ b/configure
> > > > @@ -2979,6 +2979,7 @@ showspectrum_filter_deps="avcodec"
> > > >  showspectrum_filter_select="fft"
> > > >  showspectrumpic_filter_deps="avcodec"
> > > >  showspectrumpic_filter_select="fft"
> > > > +signature_filter_deps="gpl avcodec avformat"
> > > >  smartblur_filter_deps="gpl swscale"
> > > >  sofalizer_filter_deps="netcdf avcodec"
> > > >  sofalizer_filter_select="fft"
> > > > diff --git a/doc/filters.texi b/doc/filters.texi
> > > > index 5d6cf52..a95f5a7 100644
> > > > --- a/doc/filters.texi
> > > > +++ b/doc/filters.texi
> > > > @@ -11559,6 +11559,76 @@ saturation maximum: %@{metadata:lavfi.signalstats.SATMAX@}
> > > >  @end example
> > > >  @end itemize
> > > >  
> > > > + at anchor{signature}
> > > > + at section signature
> > > > +
> > > > +Calculates the MPEG-7 Video Signature. The filter could handle more than one
> > > > +input. In this case the matching between the inputs could be calculated. The
> > > > +filter passthrough the first input. The output is written in XML.
> > > > +
> > > > +It accepts the following options:
> > > > +
> > > > + at table @option
> > > > + at item mode
> > > 
> > > > +Enable the calculation of the matching. The option value must be 0 (to disable
> > > > +or 1 (to enable). Optionally you can set the mode to 2. Then the detection ends,
> > > > +if the first matching sequence it reached. This should be slightly faster.
> > > > +Per default the detection is disabled.
> > > 
> > > these shuld probably support named identifers not (only) 0/1/2
> > done
> 
> it should use AV_OPT_TYPE_INT and AV_OPT_TYPE_CONST not a string
> 
> 
> > 
> > > 
> > > 
> > > > +
> > > > + at item nb_inputs
> > > > +Set the number of inputs. The option value must be a non negative interger.
> > > > +Default value is 1.
> > > > +
> > > > + at item filename
> > > > +Set the path to witch the output is written. If there is more than one input,
> > > > +the path must be a prototype, i.e. must contain %d or %0nd (where n is a positive
> > > > +integer), that will be replaced with the input number. If no filename is
> > > > +specified, no output will be written. This is the default.
> > > > +
> > > 
> > > > + at item xml
> > > > +Choose the output format. If set to 1 the filter will write XML, if set to 0
> > > > +the filter will write binary output. The default is 0.
> > > 
> > > format=xml/bin/whatever
> > > seems better as its more extensible
> > done
> > 
> > > 
> > > 
> > > > +
> > > > + at item th_d
> > > > +Set threshold to detect one word as similar. The option value must be an integer
> > > > +greater than zero. The default value is 9000.
> > > > +
> > > > + at item th_dc
> > > > +Set threshold to detect all words as similar. The option value must be an integer
> > > > +greater than zero. The default value is 60000.
> > > > +
> > > > + at item th_xh
> > > > +Set threshold to detect frames as similar. The option value must be an integer
> > > > +greater than zero. The default value is 116.
> > > > +
> > > > + at item th_di
> > > > +Set the minimum length of a sequence in frames to recognize it as matching
> > > > +sequence. The option value must be a non negative integer value.
> > > > +The default value is 0.
> > > > +
> > > > + at item th_it
> > > > +Set the minimum relation, that matching frames to all frames must have.
> > > > +The option value must be a double value between 0 and 1. The default value is 0.5.
> > > > + at end table
> > > > +
> > > > + at subsection Examples
> > > > +
> > > > + at itemize
> > > > + at item
> > > > +To calculate the signature of an input video and store it in signature.xml:
> > > > + at example
> > > > +ffmpeg -i input.mkv -vf signature=filename=signature.xml -map 0:v -c rawvideo -f null -
> > > > + at end example
> > > 
> > > the output seems to differ between 32 an 64bit x86
> > > this would make any regression testing rather difficult
> > > why is there a difference ? can this be avoided or would that result in
> > > some disadvantage ?
> > This is due to this line:
> > sum -= ((double) blocksum)/(blocksize * denum);
> > 
> > sum was a double. It seems the difference leads to different results in 32 and 64 bit
> > (the 5 decimal place). I have reworked the filter part so it does not use double at all.
> > This also leads in some fewer divisions, but the numbers get really big. The relevant
> > parts use int63_t.
> > 
> > If the videos gets really big, the numbers could overflow. Can I restrict this someway?
> > 
> > An upper bound could be find with:
> > 255 * BLOCK_LCM * (width/32+1)^2 * (height/32+1)^2 < 2^63
> > I tested it with 4K (UHD) input. This does not give any problems, but it is near the limit.
> > (As a note: Especially 4K is a certain amount under the limit, because the width 3840 is
> > dividable by 32, so the square in the above formula could be deleted)
> > 
> > The filter should generate the same signatures as in 64 bit before, now with 32 and 64 bit.
> 
> if you really need more tha 64bit ints you can take a look at
> libavutil/integer.h
> it would be better if the operations can be reshuffled to keep using
> intXY_t
This depends, IMHO 4K UHD is enough for now, and given, that you can simply rescale a higher
resolution to somewhat below, without changing the function of the signature, I would simply add
a check in config_input or so, that throws an error, if the resolution is too high. Would this be ok?

I only generate this high numbers to keep precision. As explanation: The algorithm uses a lot
of divisions. To avoid doubles (and high numbers) one could take AVRational, but it  is faster,
if all of the numbers are brought to the same denuminator before. Because at the end only
the relation of the numbers plays a role, I avoid dividing at all, because I use:
a/c < b/c  <=>  a < b
so c is irrelevant.

> 
> 
> [...]
> > > 
> > > 
> > > [...]
> > > 
> > > > +static MatchingInfo* get_matching_parameters(AVFilterContext *ctx, SignatureContext *sc, FineSignature *first, FineSignature *second)
> > > > +{
> > > > +    FineSignature *f,*s;
> > > > +    size_t i, j, k, l, hmax = 0, score;
> > > > +    int framerate, offset, l1dist;
> > > > +    double m;
> > > > +    MatchingInfo *cands = NULL, *c = NULL;
> > > > +
> > > > +    struct {
> > > > +        uint8_t size;
> > > > +        unsigned int dist;
> > > > +        FineSignature *a;
> > > > +        uint8_t b_pos[COURSE_SIZE];
> > > > +        FineSignature *b[COURSE_SIZE];
> > > > +    } pairs[COURSE_SIZE];
> > > > +
> > > > +    struct {
> > > > +        int dist;
> > > > +        size_t score;
> > > > +        FineSignature *a;
> > > > +        FineSignature *b;
> > > > +    } hspace[MAX_FRAMERATE][2*HOUGH_MAX_OFFSET+1]; /* houghspace */
> > > 
> > > stack space is not unlimited, some platforms have rather little
> > > i dont know if above is too large but it might be
> > This is 60*(2*90+1) = 11K. This is taken from the reference code, althought the lookup itself
> > is not standardized. Nevertheless I don't see an easy way to reduce this.
> > The houghspace represents a matrix. Theoretically it could be possible to use a list
> > instead of a matrix, but this would lead to clearly more complicated code.
> > 
> 
> > If you find it better I can allocate the space in the heap. Maybe this solves the problem.
> 
> yes, that was what i had in mind, maybe it can be put in some existing
> context to avoid explicit alloc (and av_freep handling)
The houghspace is only used in lookup. Bringing it to an existing context would allocate 11K memory
for nothing in a fairly amount of usecases.

> 
> 
> [...]
> > > [...]
> > > > +static int filter_frame(AVFilterLink *inlink, AVFrame *picref)
> > > > +{
> > > > +    AVFilterContext *ctx = inlink->dst;
> > > > +    SignatureContext *sic = ctx->priv;
> > > > +    StreamContext *sc = &(sic->streamcontexts[FF_INLINK_IDX(inlink)]);
> > > > +    FineSignature* fs;
> > > > +
> > > > +    static const uint8_t pot3[5] = { 3*3*3*3, 3*3*3, 3*3, 3, 1 };
> > > > +    /* indexes of words : 210,217,219,274,334  44,175,233,270,273  57,70,103,237,269  100,285,295,337,354  101,102,111,275,296
> > > > +    s2usw = sorted to unsorted wordvec: 44 is at index 5, 57 at index 10...
> > > > +    */
> > > > +    static const unsigned int wordvec[25] = {44,57,70,100,101,102,103,111,175,210,217,219,233,237,269,270,273,274,275,285,295,296,334,337,354};
> > > > +    static const uint8_t s2usw[25]   = { 5,10,11, 15, 20, 21, 12, 22,  6,  0,  1,  2,  7, 13, 14,  8,  9,  3, 23, 16, 17, 24,  4, 18, 19};
> > > > +
> > > > +    uint8_t wordt2b[5] = { 0, 0, 0, 0, 0 }; /* word ternary to binary */
> > > > +    uint64_t intpic[32][32];
> > > > +    uint64_t rowcount;
> > > > +    uint8_t *p = picref->data[0];
> > > > +    int inti, intj;
> > > > +    int *intjlut;
> > > > +
> > > > +    double conflist[DIFFELEM_SIZE];
> > > > +    int f = 0, g = 0, w = 0;
> > > > +    int dh1 = 1, dh2 = 1, dw1 = 1, dw2 = 1, denum, a, b;
> > > > +    int i,j,k,ternary;
> > > > +    uint64_t blocksum;
> > > > +    int blocksize;
> > > > +    double th; /* threshold */
> > > > +    double sum;
> > > > +
> > > > +    /* initialize fs */
> > > > +    if(sc->curfinesig){
> > > > +        fs = av_mallocz(sizeof(FineSignature));
> > > > +        if (!fs)
> > > > +            return AVERROR(ENOMEM);
> > > > +        sc->curfinesig->next = fs;
> > > > +        fs->prev = sc->curfinesig;
> > > > +        sc->curfinesig = fs;
> > > > +    }else{
> > > > +        fs = sc->curfinesig = sc->finesiglist;
> > > > +        sc->curcoursesig1->first = fs;
> > > > +    }
> > > > +
> > > > +    fs->pts = picref->pts;
> > > > +    fs->index = sc->lastindex++;
> > > > +
> > > > +    memset(intpic, 0, sizeof(uint64_t)*32*32);
> > > > +    intjlut = av_malloc(inlink->w * sizeof(int));
> > > > +    if (!intjlut)
> > > > +        return AVERROR(ENOMEM);
> > > > +    for (i=0; i < inlink->w; i++){
> > > > +        intjlut[i] = (i<<5)/inlink->w;
> > > > +    }
> > > > +
> > > > +    for (i=0; i < inlink->h; i++){
> > > > +        inti = (i<<5)/inlink->h;
> > > > +        for (j=0; j< inlink->w; j++){
> > > > +            intj = intjlut[j];
> > > > +            intpic[inti][intj] += p[j];
> > > > +        }
> > > > +        p += picref->linesize[0];
> > > > +    }
> > > > +    av_free(intjlut);
> > > 
> > > av_freep() is safer as use of freed memor becomes harder and more
> > > noticable
> > I don't get the function completely. Could you explain a little? Problem is that I can use
> > it exactly there. Replacing other occurences of av_free results in a segfault.
> 
> av_free(intjlut); ->av_freep(&intjlut);
> 
> 
> [...]
> > from my previous mails:
> > > BTW is division by 2 optimized out or it is better to use >> 1 ?
> > > - The timebase of the testfiles is 90000. In the binary output unfortunately there
> > > is only place for a 16 bit number, so this don't fit. Currently the code simply crop
> > > remaining bits. Is there a better solution (devide with some number etc)?
> > Would be nice, if you could comment.
> 
> i cant help if the spec gives you 16 bit and you want to store 17 in it
Of course. What I had in mind was that simply crop bits leads to an extremly wrong number.
Deviding with some number (both the timebase and the timestamps) would lead to inexact
values, but maybe values one can work with someway. The reference signature e.g uses
another (much smaller) timebase for the same video.

> 
> 
> > 
> > Then I added a few TODOs in the code, was about parts I don't know. Would be nice,
> > if you comment there, too.
> > 
> 
> > I attached the new (complete) patch, the diff to the last time and the updated check script.
> 
> looks like the old patch + diff to the new
Yes. Thought you can see the differences to the already rewieved patch much faster.

> 
> [...]
> > +static int filter_frame(AVFilterLink *inlink, AVFrame *picref)
> > +{
> > +    AVFilterContext *ctx = inlink->dst;
> > +    SignatureContext *sic = ctx->priv;
> > +    StreamContext *sc = &(sic->streamcontexts[FF_INLINK_IDX(inlink)]);
> > +    FineSignature* fs;
> > +
> > +    static const uint8_t pot3[5] = { 3*3*3*3, 3*3*3, 3*3, 3, 1 };
> > +    /* indexes of words : 210,217,219,274,334  44,175,233,270,273  57,70,103,237,269  100,285,295,337,354  101,102,111,275,296
> > +    s2usw = sorted to unsorted wordvec: 44 is at index 5, 57 at index 10...
> > +    */
> > +    static const unsigned int wordvec[25] = {44,57,70,100,101,102,103,111,175,210,217,219,233,237,269,270,273,274,275,285,295,296,334,337,354};
> > +    static const uint8_t s2usw[25]   = { 5,10,11, 15, 20, 21, 12, 22,  6,  0,  1,  2,  7, 13, 14,  8,  9,  3, 23, 16, 17, 24,  4, 18, 19};
> > +
> > +    uint8_t wordt2b[5] = { 0, 0, 0, 0, 0 }; /* word ternary to binary */
> > +    uint64_t intpic[32][32];
> > +    uint64_t rowcount;
> > +    uint8_t *p = picref->data[0];
> > +    int inti, intj;
> > +    int *intjlut;
> > +
> > +    double conflist[DIFFELEM_SIZE];
> > +    int f = 0, g = 0, w = 0;
> > +    int dh1 = 1, dh2 = 1, dw1 = 1, dw2 = 1, denum, a, b;
> > +    int i,j,k,ternary;
> > +    uint64_t blocksum;
> > +    int blocksize;
> > +    double th; /* threshold */
> > +    double sum;
> > +
> > +    /* initialize fs */
> > +    if(sc->curfinesig){
> > +        fs = av_mallocz(sizeof(FineSignature));
> > +        if (!fs)
> > +            return AVERROR(ENOMEM);
> > +        sc->curfinesig->next = fs;
> > +        fs->prev = sc->curfinesig;
> > +        sc->curfinesig = fs;
> > +    }else{
> > +        fs = sc->curfinesig = sc->finesiglist;
> > +        sc->curcoursesig1->first = fs;
> > +    }
> > +
> > +    fs->pts = picref->pts;
> > +    fs->index = sc->lastindex++;
> > +
> > +    memset(intpic, 0, sizeof(uint64_t)*32*32);
> > +    intjlut = av_malloc(inlink->w * sizeof(int));
> > +    if (!intjlut)
> > +        return AVERROR(ENOMEM);
> > +    for (i=0; i < inlink->w; i++){
> > +        intjlut[i] = (i<<5)/inlink->w;
> > +    }
> > +
> > +    for (i=0; i < inlink->h; i++){
> > +        inti = (i<<5)/inlink->h;
> > +        for (j=0; j< inlink->w; j++){
> > +            intj = intjlut[j];
> > +            intpic[inti][intj] += p[j];
> > +        }
> > +        p += picref->linesize[0];
> > +    }
> > +    av_free(intjlut);
> > +
> > +    /* The following calculate a summed area table (intpic) and brings the numbers
> > +     * in intpic to to the same denuminator.
> > +     * So you only have to handle the numinator in the following sections.
> > +     */
> > +    dh1 = inlink->h/32;
> > +    if (inlink->h%32)
> > +        dh2 = dh1 + 1;
> > +    dw1 = inlink->w/32;
> > +    if (inlink->w%32)
> > +        dw2 = dw1 + 1;
> 
> > +    denum = dh1 * dh2 * dw1 * dw2;
> 
> this will overflow if w and h are not multiplies of 32 and large
> the multiplication is done in 32bit not 64
Don't get it. All of this are 32 bit integer. Given the input is:
3842x2160 (nearly 4K), this would lead in a denum of:
120 * 121 * 67 * 68 = 66153120

This is far below the 32 bit maximum.

> 
> 
> [...]
> > @@ -138,6 +145,47 @@ static void set_bit(uint8_t* data, size_t pos)
> >      data[pos/8] |= mask;
> >  }
> >  
> > +/* TODO this is meant as some kind of lut, I assume the division is made at
> > + * compile time. Is it better, to write this with preprocessor?
> > + * (Or simply write: return BLOCK_LCM / n;), please comment */
> 
> replacing a BLOCK_LCM / n by a switch makes no sense IMHO
Thought, I can use the fact, that only a handful of values for n are possible. Could delete
the whole function of course.

> 
> 
> [...]
> > @@ -242,10 +290,16 @@ static int filter_frame(AVFilterLink *inlink, AVFrame *picref)
> >  
> >      for (i=0; i< ELEMENT_COUNT; i++){
> >          const ElemCat* elemcat = elements[i];
> > -        double* elemsignature = av_malloc(sizeof(double) * elemcat->elem_count);
> > -        double* sortsignature = av_malloc(sizeof(double) * elemcat->elem_count);
> > -        if (!elemsignature || !sortsignature)
> > +        int64_t* elemsignature;
> > +        uint64_t* sortsignature;
> > +
> > +        elemsignature = av_malloc_array(elemcat->elem_count, sizeof(int64_t));
> > +        if (!elemsignature)
> > +            return AVERROR(ENOMEM);
> 
> > +        sortsignature = av_malloc_array(elemcat->elem_count, sizeof(uint64_t));
> 
> sizeof(*sortsignature) to avoid repeating the type
> 
> 
> [...]
> > @@ -510,15 +564,11 @@ static int binary_export(AVFilterContext *ctx, StreamContext *sc, char* filename
> >          put_bits32(&buf, 0xFFFFFFFF & cs->last->pts); /* EndMediaTimeOfSegment */
> >          for (i=0; i < 5; i++){
> >              /* put 243 bits ( = 7 * 32 + 19 = 8 * 28 + 19) into buffer */
> > -            for (j=0; j < 28; j+=4){
> > -                put_bits32(&buf, 0xFFFFFFFF & (cs->data[i][j]   << 24 |
> > -                                               cs->data[i][j+1] << 16 |
> > -                                               cs->data[i][j+2] <<  8 |
> > -                                               cs->data[i][j+3]));
> > +            for (j=0; j < 30; j++){
> 
> > +                //TODO is it faster to use put_bits32 and shift?
> 
> if this is speed relevant, then you could try both and benchmark
> with START/STOP_TIMER
> 
> 
> [...]
> 
>