[FFmpeg-soc] Review of dirac code

Marco Gerards mgerards at xs4all.nl
Thu Aug 16 23:30:37 CEST 2007


Michael Niedermayer <michaelni at gmx.at> writes:

Hi,

> On Thu, Aug 16, 2007 at 07:37:24PM +0200, Marco Gerards wrote:
> [...]
>> > [...]
>> >
>> > dirac_arith.h:
>> > [...]
>> >> typedef struct dirac_arith_state {
>> >>     /* Arithmetic decoding.  */
>> >>     unsigned int low;
>> >>     unsigned int range;
>> >>     unsigned int code;
>> >>     unsigned int bits_left;
>> >>     int carry;
>> >>     unsigned int contexts[ARITH_CONTEXT_COUNT];
>> >> 
>> >>     GetBitContext *gb;
>> >>     PutBitContext *pb;
>> >
>> > GetBitContext gb;
>> > PutBitContext pb;
>> > would avoid a pointer dereference if these 2 are used in the innermost
>> > loops, though maybe their use can be avoided by reading/writing 8bit
>> > at a time similar to how our cabac reader and range coder works
>> 
>> Can I just copy the GetBitContext, PutBitContext at initialization
>> time, or is there a more efficient way to do this?
>
> no, its ok

Do you mean it is ok to copy them?


>> 
>> As for using bytes, perhaps I could do this at a later time.  I think
>> this is quite complex and thus time consuming for me.  Or is this a
>> must have for you atm?
>
> no its not important, it can be done later

ok.

> [...]
>> >>     { 4, 0, 1, 12, 16, 12, 16, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 768  },
>> >>     { 4, 0, 1, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 },
>> >>     { 4, 6, 1, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 },
>> >>     { 4, 6, 0, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 }
>> >> };
>> >
>> > maybe char/short or (u)intXY_t could be used in the structs for these tables
>> > so as to reduce the size of the tables (except for things accessed in the
>> > innermost loops, there int might be faster on some architectures ...)
>> 
>> Done.
>> 
>> Is there some rule of thumb to use for this?  So when to use ints so
>> it will be faster on same architectures vs. using small datatypes to
>> be more cache friendly?
>
> that is a good question
> i tend to use the smallest data type possible for arrays and int for
> everything else unless theres some other data type needed
>
> though thats just what i do, i dont know how often this is the best choice
> when in doubt and the code is importanat, benchmark (preferably on several
> architectures...)

Thanks, I keep that in mind :-)

>
> [...]
>> >
>> > [...]
>> >>     /* Framerate.  */
>> >>     if (get_bits(gb, 1)) {
>> >>         int idx = svq3_get_ue_golomb(gb);
>> >>         if (! idx) {
>> >>             s->source.frame_rate.num = svq3_get_ue_golomb(gb);
>> >>             s->source.frame_rate.den = svq3_get_ue_golomb(gb);
>> >>         } else {
>> >>             /* Use a pre-set framerate.  */
>> >>             s->source.frame_rate = preset_frame_rates[idx - 1];
>> >>         }
>> >
>> > idx should be checked against the table size
>> 
>> I still have to do this.  Why is this needed actually?  For broken
>> encoders or so?
>
> well if you dont check the code can segfault, and there may be situations
> where this can be annoying though its not really important
> there are probably plenty of ways to make libavcodec segfault if someone
> wanted to achive that

ok

> [...]
>> > [...]
>> >> static int inline coeff_quant_factor(int idx) {
>> >>     uint64_t base;
>> >>     idx = FFMAX(idx, 0);
>> >
>> > is <0 allowed? if no this should be checked for somewhere and lead to failure
>> > to decode the frame
>> 
>> I think it is.  At least this check is required by the specification.
>
> if its allowed then iam curious how it can become negative, except by
> a too large golomb_ue which happens to set the sign bit ...
> this sounds a little odd to be intended behavior ...

Yes, it seems impossible.  I'll just remove FFMAX, if you do not mind.

> [...]
>> >> 
>> >>     if (! magnitude)
>> >>         return 0;
>> >
>> > assuming qfactor is not 0 this check could be done before the multiply
>> 
>> It seems that it can become 0, although this seems a bit weird to me.
>
> i dont see how it could become 0 ...

You are right, I was reading too fast and thought "1 << 0 = 0".  I
will make this change.

> [...]
>> >> 
>> >>     /* Check if there is a zero to the left and top left of this
>> >>        coefficient.  */
>> >>     if (v > 0 && ((data[x + (y - 1) * s->padded_width])
>> >>                   || ( h > 0 && data[x + (y - 1) * s->padded_width - 1])))
>> >>         return 0;
>> >>     else if (h > 0 && data[x + y * s->padded_width - 1])
>> >>         return 0;
>> >> 
>> >>     return 1;
>> >
>> > if you make the data array slightly bigger and give it a 0 border 
>> > (this might be too hard)
>> > or if you have seperate functions for the border and middle(h>0,v>0)
>> > then this would be just a
>> > return !(data[-s->padded_width]|data[-1]);
>> > assuming data has already been corrected to point to x+y*padded_width
>> 
>> In that can I would have to make two versions of the calling function,
>> right?  Or I would have to introduce an if there, but that would be
>> moving the problem around, I think?
>
> there are many possible solutions ...
> 2 functions is one
>
> a
> static inline zero_neighbourhood(..., int border){
> }
>
> which is called as zero_neighbourhood(..., 0) or 
>  zero_neighbourhood(..., 1)
>
> would be anoter one, here gcc should with some luck inline the function
> and optimize the part out which isnt needed ...

Ok, I'll try that.

> [...]
>> > [...]
>> >> /**
>> >>  * Predict the motion vector
>> >>  *
>> >>  * @param x    horizontal position of the MC block
>> >>  * @param y    vertical position of the MC block
>> >>  * @param ref reference frame
>> >>  * @param dir direction horizontal=0, vertical=1
>> >>  */
>> >> static int motion_vector_prediction(DiracContext *s, int x, int y,
>> >>                                     int ref, int dir) {
>> >>     int cnt = 0;
>> >>     int left = 0, top = 0, lefttop = 0;
>> >> 
>> >>     if (x > 0) {
>> >>         /* Test if the block to the left has a motion vector for this
>> >>            reference frame.  */
>> >>         if (!s->blmotion[y * s->blwidth + x - 1].use_global
>> >>             && s->blmotion[y * s->blwidth + x - 1].use_ref[ref]) {
>> >>             left = s->blmotion[y * s->blwidth + x - 1].vect[ref][dir];
>> >>             cnt++;
>> >>         }
>> >> 
>> >>         /* This is the only reference, return it.  */
>> >>         if (y == 0)
>> >>             return left;
>> >>     }
>> >> 
>> >>     if (y > 0) {
>> >>         /* Test if the block above the current one has a motion vector
>> >>            for this reference frame.  */
>> >>         if (!s->blmotion[(y - 1) * s->blwidth + x].use_global
>> >>             && s->blmotion[(y - 1) * s->blwidth + x].use_ref[ref]) {
>> >>             top = s->blmotion[(y - 1) * s->blwidth + x].vect[ref][dir];
>> >>             cnt++;
>> >>             }
>> >> 
>> >>         /* This is the only reference, return it.  */
>> >>         if (x == 0)
>> >>             return top;
>> >>     }
>> >> 
>> >>     if (x > 0 && y > 0) {
>> >>         /* Test if the block above the current one has a motion vector
>> >>            for this reference frame.  */
>> >>         if (!s->blmotion[(y - 1) * s->blwidth + x - 1].use_global
>> >>             && s->blmotion[(y - 1) * s->blwidth + x - 1].use_ref[ref]) {
>> >>             lefttop = s->blmotion[(y - 1) * s->blwidth + x - 1].vect[ref][dir];
>> >>             cnt++;
>> >>         }
>> >>     }
>> >> 
>> >>     /* No references for the prediction.  */
>> >>     if (cnt == 0)
>> >>         return 0;
>> >> 
>> >>     if (cnt == 1)
>> >>         return left + top + lefttop;
>> >> 
>> >>     /* Return the median of two motion vectors.  */
>> >>     if (cnt == 2)
>> >>         return (left + top + lefttop + 1) >> 1;
>> >> 
>> >>     /* Return the median of three motion vectors.  */
>> >>     return mid_pred(left, top, lefttop);
>> >> }
>> >
>> > all the checks above are done twice, once with dir=0 then dir=1 maybe that
>> > can be avoided though it seems diracs design doesnt make that easy :(
>> 
>> I am afraid that won't improve much and just makes the code unreadable
>> :-/.  Would you mind if I leave the current code like it is?
>
> leave it, but flame the dirac devels for the crappy design they should
> store the dir=0 and dir=1 values nicely in pairs not oddly partitioned

:-)

> [...]
>> > [...]
>> >> /**
>> >>  * IDWT transform (5,3) for a specific subband
>> >>  *
>> >>  * @param data coefficients to transform
>> >>  * @param level level of the current transform
>> >>  * @return 0 when successful, otherwise -1 is returned
>> >>  */
>> >> static int dirac_subband_idwt_53(DiracContext *s,
>> >>                                  int16_t *data, int level) {
>> >
>> > the whole (inverse) wavelet transform related code could be split into its
>> > own file
>> 
>> What is the filename you propose?  dirac_wavelet.[c]h or so?  Or would
>> something more generic be more appropriate so snow and jpeg2000 can
>> share code or at least a file?
>
> its very easy to change the filename later in svn as well as git so it
> really doesnt matter dirac_wavelet.c/h sounds ok

ok

>> 
>> >>     int16_t *synth, *synthline;
>> >>     int x, y;
>> >>     int width = subband_width(s, level);
>> >>     int height = subband_height(s, level);
>> >>     int synth_width = width  << 1;
>> >>     int synth_height = height << 1;
>> >> 
>> >> START_TIMER
>> >> 
>> >>     synth = av_malloc(synth_width * synth_height * sizeof(int16_t));
>> >>     if (!synth) {
>> >>         av_log(s->avctx, AV_LOG_ERROR, "av_malloc() failed\n");
>> >>         return -1;
>> >>     }
>> >
>> > theres no need to do a malloc() during idwt
>> 
>> It is used for IDWT reordering.  Do you mean that this can be done in
>> place?
>
> i meant that the malloc() can be done during init instead of during
> decode_frame()

ok

> and yes it can also be done in place too or the reordering can be
> done during the lifting ...

This sounds interesting, I will look into that.

>> 
>> >
>> >> 
>> >>     dirac_subband_idwt_reorder(s, data, synth, level);
>> >> 
>> >>     /* LeGall(5,3)
>> >>        First lifting step)
>> >>        Even, predict, s=5, t_{-1}=-1, t_0=9, t_1=9, t_2=-1:
>> >>          A[2*n]   -= (-A[2*n-1] + A[2*n+1] + 2) >> 2
>> >> 
>> >>        Second lifting step)
>> >>        Odd, update, s=1, t_0=1, t_1=1:
>> >>          A[2*n+1] += (A[2*n] + A[2*n+2] + 1) >> 1
>> >>     */
>> >> 
>> >>     /* Vertical synthesis: Lifting stage 1.  */
>> >>     synthline = synth;
>> >>     for (x = 0; x < synth_width; x++) {
>> >>         synthline[x] -= (synthline[synth_width + x]
>> >>                        + synthline[synth_width + x]
>> >>                        + 2) >> 2;
>> >>     }
>> >>     synthline = synth + (synth_width << 1);
>> >>     for (y = 1; y < height - 1; y++) {
>> >>         for (x = 0; x < synth_width; x++) {
>> >>             synthline[x] -= (synthline[x - synth_width]
>> >>                            + synthline[x + synth_width]
>> >>                            + 2) >> 2;
>> >>         }
>> >>         synthline += (synth_width << 1);
>> >>     }
>> >>     synthline = synth + (synth_height - 2) * synth_width;
>> >>     for (x = 0; x < synth_width; x++) {
>> >>         synthline[x] -= (synthline[x - synth_width]
>> >>                        + synthline[x + synth_width]
>> >>                        + 2) >> 2;
>> >>     }
>> >> 
>> >>     /* Vertical synthesis: Lifting stage 2.  */
>> >>     synthline = synth + synth_width;
>> >>     for (x = 0; x < synth_width; x++)
>> >>         synthline[x] += (synthline[x - synth_width]
>> >>                        + synthline[x + synth_width]
>> >>                        + 1) >> 1;
>> >>     synthline = synth + (synth_width << 1);
>> >>     for (y = 1; y < height - 1; y++) {
>> >>         for (x = 0; x < synth_width; x++) {
>> >>             synthline[x + synth_width] += (synthline[x]
>> >>                                          + synthline[x + synth_width * 2]
>> >>                                          + 1) >> 1;
>> >>         }
>> >>         synthline += (synth_width << 1);
>> >>     }
>> >>     synthline = synth + (synth_height - 1) * synth_width;
>> >>     for (x = 0; x < synth_width; x++)
>> >>         synthline[x] += (synthline[x - synth_width]
>> >>                        + synthline[x - synth_width]
>> >>                        + 1) >> 1;
>> >> 
>> >> 
>> >>     /* Horizontal synthesis.  */
>> >>     synthline = synth;
>> >>     for (y = 0; y < synth_height; y++) {
>> >> 
>> >>         /* Lifting stage 1.  */
>> >>         synthline[0] -= (synthline[1]
>> >>                        + synthline[1]
>> >>                        + 2) >> 2;
>> >>         for (x = 1; x < width - 1; x++) {
>> >>             synthline[2*x] -= (synthline[2*x - 1]
>> >>                              + synthline[2*x + 1]
>> >>                              + 2) >> 2;
>> >>         }
>> >>         synthline[synth_width - 2] -= (synthline[synth_width - 3]
>> >>                                      + synthline[synth_width - 1]
>> >>                                      + 2) >> 2;
>> >> 
>> >>         /* Lifting stage 2.  */
>> >>         for (x = 0; x < width - 1; x++) {
>> >>             synthline[2*x + 1] += (synthline[2*x]
>> >>                                  + synthline[2*x + 2]
>> >>                                  + 1) >> 1;
>> >>         }
>> >>         synthline[synth_width - 1] += (synthline[synth_width - 2]
>> >>                                      + synthline[synth_width - 2]
>> >>                                      + 1) >> 1;
>> >>         synthline += synth_width;
>> >>     }
>> >
>> > as already said the stages can be interleaved so that only a single pass
>> > is done over the data, this should significantly speed the code up
>> > as its likely limited by memory speed ...
>> 
>> Yes...  I now tried this (for 5/3), but I am afraid I did this the
>> wrong way.  I assume I have to be very careful and apply the second
>> stage only for lines that will not be used by the first stage anymore?
>
> yes, the stuff must be done in order

Ok.

> [...]
>> >
>> > this is VERY inefficient, most of what is done here does not need to be done
>> > per pixel or can be precalculated
>> 
>> Absolutely!  gprof agrees with you ;-)
>> 
>> Actually, this comment applies to all code above, I assume.
>> 
>>  time   seconds   seconds    calls  ms/call  ms/call  name    
>>  40.85      2.41     2.41 88939320     0.00     0.00  upconvert
>>  29.49      4.15     1.74       50    34.80   113.59  dirac_decode_frame
>>  10.34      4.76     0.61   417867     0.00     0.00  motion_comp_block1ref
>>   6.78      5.16     0.40      540     0.74     0.74  dirac_subband_idwt_53
>>   4.24      5.41     0.25  8689089     0.00     0.00  coeff_unpack
>>   4.07      5.65     0.24  8707719     0.00     0.00  dirac_arith_read_uint
>> 
>> Luca and I talked about if I would concentrate on further
>> optimizations or on the encoder and he said he preferred having me
>> working on the encoder for now, because I know Dirac quite well by now
>> (Luca, please tell me if I am wrong in any way).
>> 
>> So there is quite some room for optimizations.  In the specification,
>> the used pseudo code looped over the pixels.  My current code at least
>> doesn't do that.  But I agree, the MC code is where a lot of
>> optimizations can be made:
>> 
>> - The qpel/eightpel interpolation code is very inefficient and perhaps
>>   has to be merged with motion_comp_block1ref and
>>   motion_comp_block2ref.  In that case it can be optimized.  We
>>   especially have to take care of border cases to avoid clipping, more
>>   or less like I did for halfpel interpolation.
>
> simply making upconvert() work not with 1 pixel but with a block of pixels
> should make it alot more efficient

I moved this into motion_comp_block1ref/motion_comp_block2ref.  This
works quite well and boosts the speed of the decoder by 50% and makes
further optimizations easier.  Hopefully you do not have problems with
this approach.

>> Actually, this is the code which I dislike the most.  But we had to
>> make a choise on how to spend the time.  Is this a showstopper for
>> you?
>
> no, though its something which should be optimized its likely the worst
> part of the dirac code ATM

The latest commits make this code work in a reasonable speed.  The is
still a lot of room for improvement, but it is not as terrible as it
used to me.  I will improve this further after SoC.

>> I hope people will be interested in optimizing my code, vectorizing
>> it, etc. after summer of code.  I certainly am as much as time permits
>> (I have to graduate soon...).  So after FFmpeg, feel free to add me to
>> the MAINTAINERS list for Dirac, if you think I am up to it and if
>> you'd like that.
>
> add yourself now to maintainers, you have svn write access :)

Sure, but first we need something to maintain in svn, so I'll do this
after SoC.

Thanks,
Marco




More information about the FFmpeg-soc mailing list