[FFmpeg-devel] [PATCHv2] add signature filter for MPEG7 video signature

Mon Apr 11 04:25:28 CEST 2016

On Donnerstag, 7. April 2016 00:35:25 CEST Michael Niedermayer wrote:
> On Wed, Mar 30, 2016 at 11:02:36PM +0200, Gerion Entrup wrote:
> > On Mittwoch, 30. März 2016 22:57:47 CEST Gerion Entrup wrote:
> > > Add improved patch.
> > 
> > Rebased to master.
> > 
> 
> >  Changelog                      |    1 
> >  configure                      |    1 
> >  doc/filters.texi               |   70 +++
> >  libavfilter/Makefile           |    1 
> >  libavfilter/allfilters.c       |    1 
> >  libavfilter/signature.h        |  554 ++++++++++++++++++++++++++++++
> >  libavfilter/signature_lookup.c |  550 ++++++++++++++++++++++++++++++
> >  libavfilter/version.h          |    4 
> >  libavfilter/vf_signature.c     |  741 +++++++++++++++++++++++++++++++++++++++++
> >  9 files changed, 1921 insertions(+), 2 deletions(-)
> > 9192f27ded45c607996b4e266b6746f807c9a7fd  0001-add-signature-filter-for-MPEG7-video-signature.patch
> > From 9646ed6f0cf78356cf2914a60705c98d8f21fe8a Mon Sep 17 00:00:00 2001
> > From: Gerion Entrup <gerion.entrup at flump.de>
> > Date: Sun, 20 Mar 2016 11:10:31 +0100
> > Subject: [PATCH] add signature filter for MPEG7 video signature
> > 
> > This filter does not implement all features of MPEG7. Missing features:
> > - compression of signature files
> > - work only on (cropped) parts of the video
> > ---
> >  Changelog                      |   1 +
> >  configure                      |   1 +
> >  doc/filters.texi               |  70 ++++
> >  libavfilter/Makefile           |   1 +
> >  libavfilter/allfilters.c       |   1 +
> >  libavfilter/signature.h        | 554 ++++++++++++++++++++++++++++++
> >  libavfilter/signature_lookup.c | 550 ++++++++++++++++++++++++++++++
> >  libavfilter/version.h          |   4 +-
> >  libavfilter/vf_signature.c     | 741 +++++++++++++++++++++++++++++++++++++++++
> >  9 files changed, 1921 insertions(+), 2 deletions(-)
> >  create mode 100644 libavfilter/signature.h
> >  create mode 100644 libavfilter/signature_lookup.c
> >  create mode 100644 libavfilter/vf_signature.c
> > 
> > diff --git a/Changelog b/Changelog
> > index 7b0187d..8a2b7fd 100644
> > --- a/Changelog
> > +++ b/Changelog
> > @@ -18,6 +18,7 @@ version <next>:
> >  - coreimage filter (GPU based image filtering on OSX)
> >  - libdcadec removed
> >  - bitstream filter for extracting DTS core
> > +- MPEG-7 Video Signature filter
> >  
> >  version 3.0:
> >  - Common Encryption (CENC) MP4 encoding and decoding support
> > diff --git a/configure b/configure
> > index e550547..fe29827 100755
> > --- a/configure
> > +++ b/configure
> > @@ -2979,6 +2979,7 @@ showspectrum_filter_deps="avcodec"
> >  showspectrum_filter_select="fft"
> >  showspectrumpic_filter_deps="avcodec"
> >  showspectrumpic_filter_select="fft"
> > +signature_filter_deps="gpl avcodec avformat"
> >  smartblur_filter_deps="gpl swscale"
> >  sofalizer_filter_deps="netcdf avcodec"
> >  sofalizer_filter_select="fft"
> > diff --git a/doc/filters.texi b/doc/filters.texi
> > index 5d6cf52..a95f5a7 100644
> > --- a/doc/filters.texi
> > +++ b/doc/filters.texi
> > @@ -11559,6 +11559,76 @@ saturation maximum: %@{metadata:lavfi.signalstats.SATMAX@}
> >  @end example
> >  @end itemize
> >  
> > + at anchor{signature}
> > + at section signature
> > +
> > +Calculates the MPEG-7 Video Signature. The filter could handle more than one
> > +input. In this case the matching between the inputs could be calculated. The
> > +filter passthrough the first input. The output is written in XML.
> > +
> > +It accepts the following options:
> > +
> > + at table @option
> > + at item mode
> 
> > +Enable the calculation of the matching. The option value must be 0 (to disable
> > +or 1 (to enable). Optionally you can set the mode to 2. Then the detection ends,
> > +if the first matching sequence it reached. This should be slightly faster.
> > +Per default the detection is disabled.
> 
> these shuld probably support named identifers not (only) 0/1/2
done

> 
> 
> > +
> > + at item nb_inputs
> > +Set the number of inputs. The option value must be a non negative interger.
> > +Default value is 1.
> > +
> > + at item filename
> > +Set the path to witch the output is written. If there is more than one input,
> > +the path must be a prototype, i.e. must contain %d or %0nd (where n is a positive
> > +integer), that will be replaced with the input number. If no filename is
> > +specified, no output will be written. This is the default.
> > +
> 
> > + at item xml
> > +Choose the output format. If set to 1 the filter will write XML, if set to 0
> > +the filter will write binary output. The default is 0.
> 
> format=xml/bin/whatever
> seems better as its more extensible
done

> 
> 
> > +
> > + at item th_d
> > +Set threshold to detect one word as similar. The option value must be an integer
> > +greater than zero. The default value is 9000.
> > +
> > + at item th_dc
> > +Set threshold to detect all words as similar. The option value must be an integer
> > +greater than zero. The default value is 60000.
> > +
> > + at item th_xh
> > +Set threshold to detect frames as similar. The option value must be an integer
> > +greater than zero. The default value is 116.
> > +
> > + at item th_di
> > +Set the minimum length of a sequence in frames to recognize it as matching
> > +sequence. The option value must be a non negative integer value.
> > +The default value is 0.
> > +
> > + at item th_it
> > +Set the minimum relation, that matching frames to all frames must have.
> > +The option value must be a double value between 0 and 1. The default value is 0.5.
> > + at end table
> > +
> > + at subsection Examples
> > +
> > + at itemize
> > + at item
> > +To calculate the signature of an input video and store it in signature.xml:
> > + at example
> > +ffmpeg -i input.mkv -vf signature=filename=signature.xml -map 0:v -c rawvideo -f null -
> > + at end example
> 
> the output seems to differ between 32 an 64bit x86
> this would make any regression testing rather difficult
> why is there a difference ? can this be avoided or would that result in
> some disadvantage ?
This is due to this line:
sum -= ((double) blocksum)/(blocksize * denum);

sum was a double. It seems the difference leads to different results in 32 and 64 bit
(the 5 decimal place). I have reworked the filter part so it does not use double at all.
This also leads in some fewer divisions, but the numbers get really big. The relevant
parts use int63_t.

If the videos gets really big, the numbers could overflow. Can I restrict this someway?

An upper bound could be find with:
255 * BLOCK_LCM * (width/32+1)^2 * (height/32+1)^2 < 2^63
I tested it with 4K (UHD) input. This does not give any problems, but it is near the limit.
(As a note: Especially 4K is a certain amount under the limit, because the width 3840 is
dividable by 32, so the square in the above formula could be deleted)

The filter should generate the same signatures as in 64 bit before, now with 32 and 64 bit.

> 
> 
> [...]
> 
> > +static unsigned int intersection_word(uint8_t *first, uint8_t *second)
> > +{
> > +    unsigned int val=0,i;
> > +    for(i=0; i < 28; i+=4){
> > +        val += av_popcount( (first[i]   & second[i]  ) << 24 |
> > +                            (first[i+1] & second[i+1]) << 16 |
> > +                            (first[i+2] & second[i+2]) << 8  |
> > +                            (first[i+3] & second[i+3]) );
> 
> see AV_RN32() and similar functions
> (make sure that if you pick one that requires alignment that alignment
> is provided)
If I get it right, the 4 8-bit numbers are concatenated to 32 bit (AV_RN32), then again
split into 16 bit (put_bits32) and finally written in 8 bit pieces into the stream (put_bits).

I've changed the code, that is uses put_bits(..., 8, ...) directly (definitely simplifies the code,
don't know whether it changes the performance).

> 
> 
> > +    }
> > +    val += av_popcount( (first[28] & second[28]) << 16 |
> > +                        (first[29] & second[29]) << 8  |
> > +                        (first[30] & second[30]) );
> > +    return val;
> > +}
> > +
> 
> > +static unsigned int union_word(uint8_t *first, uint8_t *second)
> 
> unchanged pointer arguments could be marked as const
done

> 
> 
> [...]
> 
> > +static MatchingInfo* get_matching_parameters(AVFilterContext *ctx, SignatureContext *sc, FineSignature *first, FineSignature *second)
> > +{
> > +    FineSignature *f,*s;
> > +    size_t i, j, k, l, hmax = 0, score;
> > +    int framerate, offset, l1dist;
> > +    double m;
> > +    MatchingInfo *cands = NULL, *c = NULL;
> > +
> > +    struct {
> > +        uint8_t size;
> > +        unsigned int dist;
> > +        FineSignature *a;
> > +        uint8_t b_pos[COURSE_SIZE];
> > +        FineSignature *b[COURSE_SIZE];
> > +    } pairs[COURSE_SIZE];
> > +
> > +    struct {
> > +        int dist;
> > +        size_t score;
> > +        FineSignature *a;
> > +        FineSignature *b;
> > +    } hspace[MAX_FRAMERATE][2*HOUGH_MAX_OFFSET+1]; /* houghspace */
> 
> stack space is not unlimited, some platforms have rather little
> i dont know if above is too large but it might be
This is 60*(2*90+1) = 11K. This is taken from the reference code, althought the lookup itself
is not standardized. Nevertheless I don't see an easy way to reduce this.
The houghspace represents a matrix. Theoretically it could be possible to use a list
instead of a matrix, but this would lead to clearly more complicated code.

If you find it better I can allocate the space in the heap. Maybe this solves the problem.

> 
> 
> > +
> > +
> > +    /* l1 distances */
> > +    for(i = 0, f = first; i < COURSE_SIZE && f->next; i++, f = f->next){
> > +        pairs[i].size = 0;
> > +        pairs[i].dist = 99999;
> > +        pairs[i].a = f;
> > +        for(j = 0, s = second; j < COURSE_SIZE && s->next; j++, s = s->next){
> > +            /* l1 distance of finesignature */
> > +            l1dist = get_l1dist(ctx, sc, f->framesig, s->framesig);
> > +            if(l1dist < sc->thl1){
> > +                if(l1dist < pairs[i].dist){
> > +                    pairs[i].size = 1;
> > +                    pairs[i].dist = l1dist;
> > +                    pairs[i].b_pos[0] = j;
> > +                    pairs[i].b[0] = s;
> > +                } else if(l1dist == pairs[i].dist){
> > +                    pairs[i].b[pairs[i].size] = s;
> > +                    pairs[i].b_pos[pairs[i].size] = j;
> > +                    pairs[i].size++;
> > +                }
> > +            }
> > +        }
> > +    }
> > +    /* last incomplete coursesignature */
> > +    if(f->next == NULL){
> > +        for(; i < COURSE_SIZE; i++){
> > +            pairs[i].size = 0;
> > +            pairs[i].dist = 99999;
> > +        }
> > +    }
> > +
> > +    /* initialize houghspace */
> > +    for(i = 0; i < MAX_FRAMERATE; i++){
> > +        for(j = 0; j < HOUGH_MAX_OFFSET; j++){
> > +            hspace[i][j].score = 0;
> > +            hspace[i][j].dist = 99999;
> > +        }
> > +    }
> > +
> > +    /* hough transformation */
> > +    for(i = 0; i < COURSE_SIZE; i++){
> > +        for(j = 0; j < pairs[i].size; j++){
> > +            for(k = i+1; k < COURSE_SIZE; k++){
> > +                for(l = 0; l < pairs[k].size; l++){
> > +                    if(pairs[i].b[j] != pairs[k].b[l]){
> > +                        /* linear regression */
> > +                        m = (pairs[k].b_pos[l]-pairs[i].b_pos[j]) / (k-i); /* good value between 0.0 - 2.0 */
> > +                        framerate = (int) m*30 + 0.5; /* round up to 0 - 60 */
> > +                        if(framerate>0 && framerate <= MAX_FRAMERATE){
> > +                            offset = pairs[i].b_pos[j] - ((int) m*i + 0.5); /* only second part has to be rounded up */
> > +                            if(offset > -HOUGH_MAX_OFFSET && offset < HOUGH_MAX_OFFSET){
> > +                                if(pairs[i].dist < pairs[k].dist){
> > +                                    if(pairs[i].dist < hspace[framerate-1][offset+HOUGH_MAX_OFFSET].dist){
> > +                                        hspace[framerate-1][offset+HOUGH_MAX_OFFSET].dist = pairs[i].dist;
> > +                                        hspace[framerate-1][offset+HOUGH_MAX_OFFSET].a = pairs[i].a;
> > +                                        hspace[framerate-1][offset+HOUGH_MAX_OFFSET].b = pairs[i].b[j];
> > +                                    }
> > +                                } else {
> > +                                    if(pairs[k].dist < hspace[framerate-1][offset+HOUGH_MAX_OFFSET].dist){
> > +                                        hspace[framerate-1][offset+HOUGH_MAX_OFFSET].dist = pairs[k].dist;
> > +                                        hspace[framerate-1][offset+HOUGH_MAX_OFFSET].a = pairs[k].a;
> > +                                        hspace[framerate-1][offset+HOUGH_MAX_OFFSET].b = pairs[k].b[l];
> > +                                    }
> > +                                }
> > +
> > +                                score = hspace[framerate-1][offset+HOUGH_MAX_OFFSET].score + 1;
> > +                                if(score > hmax)
> > +                                    hmax = score;
> > +                                hspace[framerate-1][offset+HOUGH_MAX_OFFSET].score = score;
> > +                            }
> > +                        }
> > +                    }
> > +                }
> > +            }
> > +        }
> > +    }
> > +
> > +    if(hmax > 0){
> > +        hmax = (int) (0.7*hmax);
> > +        for(i = 0; i < MAX_FRAMERATE; i++){
> > +            for(j = 0; j < HOUGH_MAX_OFFSET; j++){
> > +                if(hmax < hspace[i][j].score){
> > +                    if(c == NULL){
> > +                        c = av_malloc(sizeof(MatchingInfo));
> > +                        if (!c)
> > +                            av_log(ctx, AV_LOG_FATAL, "Could not allocate memory");
> > +                        cands = c;
> > +                    } else {
> > +                        c->next = av_malloc(sizeof(MatchingInfo));
> > +                        if (!c->next)
> > +                            av_log(ctx, AV_LOG_FATAL, "Could not allocate memory");
> > +                        c = c->next;
> > +                    }
> > +                    c->framerateratio = (i+1.0) / 30;
> > +                    c->score = hspace[i][j].score;
> > +                    c->offset = j-90;
> > +                    c->first = hspace[i][j].a;
> > +                    c->second = hspace[i][j].b;
> > +                    c->next = NULL;
> > +
> > +                    /* not used */
> > +                    c->meandist = 0;
> > +                    c->matchframes = 0;
> > +                    c->whole = 0;
> > +                }
> > +            }
> > +        }
> > +    }
> > +    return cands;
> > +}
> > +
> 
> [...]
> > +static int filter_frame(AVFilterLink *inlink, AVFrame *picref)
> > +{
> > +    AVFilterContext *ctx = inlink->dst;
> > +    SignatureContext *sic = ctx->priv;
> > +    StreamContext *sc = &(sic->streamcontexts[FF_INLINK_IDX(inlink)]);
> > +    FineSignature* fs;
> > +
> > +    static const uint8_t pot3[5] = { 3*3*3*3, 3*3*3, 3*3, 3, 1 };
> > +    /* indexes of words : 210,217,219,274,334  44,175,233,270,273  57,70,103,237,269  100,285,295,337,354  101,102,111,275,296
> > +    s2usw = sorted to unsorted wordvec: 44 is at index 5, 57 at index 10...
> > +    */
> > +    static const unsigned int wordvec[25] = {44,57,70,100,101,102,103,111,175,210,217,219,233,237,269,270,273,274,275,285,295,296,334,337,354};
> > +    static const uint8_t s2usw[25]   = { 5,10,11, 15, 20, 21, 12, 22,  6,  0,  1,  2,  7, 13, 14,  8,  9,  3, 23, 16, 17, 24,  4, 18, 19};
> > +
> > +    uint8_t wordt2b[5] = { 0, 0, 0, 0, 0 }; /* word ternary to binary */
> > +    uint64_t intpic[32][32];
> > +    uint64_t rowcount;
> > +    uint8_t *p = picref->data[0];
> > +    int inti, intj;
> > +    int *intjlut;
> > +
> > +    double conflist[DIFFELEM_SIZE];
> > +    int f = 0, g = 0, w = 0;
> > +    int dh1 = 1, dh2 = 1, dw1 = 1, dw2 = 1, denum, a, b;
> > +    int i,j,k,ternary;
> > +    uint64_t blocksum;
> > +    int blocksize;
> > +    double th; /* threshold */
> > +    double sum;
> > +
> > +    /* initialize fs */
> > +    if(sc->curfinesig){
> > +        fs = av_mallocz(sizeof(FineSignature));
> > +        if (!fs)
> > +            return AVERROR(ENOMEM);
> > +        sc->curfinesig->next = fs;
> > +        fs->prev = sc->curfinesig;
> > +        sc->curfinesig = fs;
> > +    }else{
> > +        fs = sc->curfinesig = sc->finesiglist;
> > +        sc->curcoursesig1->first = fs;
> > +    }
> > +
> > +    fs->pts = picref->pts;
> > +    fs->index = sc->lastindex++;
> > +
> > +    memset(intpic, 0, sizeof(uint64_t)*32*32);
> > +    intjlut = av_malloc(inlink->w * sizeof(int));
> > +    if (!intjlut)
> > +        return AVERROR(ENOMEM);
> > +    for (i=0; i < inlink->w; i++){
> > +        intjlut[i] = (i<<5)/inlink->w;
> > +    }
> > +
> > +    for (i=0; i < inlink->h; i++){
> > +        inti = (i<<5)/inlink->h;
> > +        for (j=0; j< inlink->w; j++){
> > +            intj = intjlut[j];
> > +            intpic[inti][intj] += p[j];
> > +        }
> > +        p += picref->linesize[0];
> > +    }
> > +    av_free(intjlut);
> 
> av_freep() is safer as use of freed memor becomes harder and more
> noticable
I don't get the function completely. Could you explain a little? Problem is that I can use
it exactly there. Replacing other occurences of av_free results in a segfault.

> 
> 
> > +
> > +    /* The following calculate a summed area table (intpic) and brings the numbers
> > +     * in intpic to to the same denuminator.
> > +     * So you only have to handle the numinator in the following sections.
> > +     */
> > +    dh1 = inlink->h/32;
> > +    if (inlink->h%32)
> > +        dh2 = dh1 + 1;
> > +    dw1 = inlink->w/32;
> > +    if (inlink->w%32)
> > +        dw2 = dw1 + 1;
> > +    denum = dh1 * dh2 * dw1 * dw2;
> > +
> > +    for (i=0; i<32; i++){
> > +        rowcount = 0;
> > +        a = 1;
> > +        if (dh2 > 1) {
> > +            a = ((inlink->h*(i+1))%32 == 0) ? (inlink->h*(i+1))/32 - 1 : (inlink->h*(i+1))/32;
> > +            a -= ((inlink->h*i)%32 == 0) ? (inlink->h*i)/32 - 1 : (inlink->h*i)/32;
> > +            a = (a == dh1)? dh2 : dh1;
> > +        }
> > +        for (j=0; j<32; j++){
> > +            b = 1;
> > +            if (dw2 > 1) {
> > +                b = ((inlink->w*(j+1))%32 == 0) ? (inlink->w*(j+1))/32 - 1 : (inlink->w*(j+1))/32;
> > +                b -= ((inlink->w*j)%32 == 0) ? (inlink->w*j)/32 - 1 : (inlink->w*j)/32;
> > +                b = (b == dw1)? dw2 : dw1;
> > +            }
> > +            rowcount += intpic[i][j] *= a * b;
> > +            if(i>0){
> > +                intpic[i][j] = intpic[i-1][j] + rowcount;
> > +            } else {
> > +                intpic[i][j] = rowcount;
> > +            }
> > +        }
> > +    }
> > +
> > +    for (i=0; i< ELEMENT_COUNT; i++){
> > +        const ElemCat* elemcat = elements[i];
> 
> > +        double* elemsignature = av_malloc(sizeof(double) * elemcat->elem_count);
> > +        double* sortsignature = av_malloc(sizeof(double) * elemcat->elem_count);
> 
> av_malloc_array()
done

> 
> > +        if (!elemsignature || !sortsignature)
> > +            return AVERROR(ENOMEM);
> 
> memleak
hopefully fixed

from my previous mails:
> BTW is division by 2 optimized out or it is better to use >> 1 ?
> - The timebase of the testfiles is 90000. In the binary output unfortunately there
> is only place for a 16 bit number, so this don't fit. Currently the code simply crop
> remaining bits. Is there a better solution (devide with some number etc)?
Would be nice, if you could comment.

Then I added a few TODOs in the code, was about parts I don't know. Would be nice,
if you comment there, too.

I attached the new (complete) patch, the diff to the last time and the updated check script.

Gerion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-add-signature-filter-for-MPEG7-video-signature.patch
Type: text/x-patch
Size: 78553 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20160411/ca381696/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diff-to-last-revision.diff
Type: text/x-patch
Size: 24346 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20160411/ca381696/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: check.py
Type: text/x-python
Size: 10567 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20160411/ca381696/attachment.py>