[FFmpeg-soc] Review of dirac code
Marco Gerards
mgerards at xs4all.nl
Thu Aug 16 23:30:37 CEST 2007
Michael Niedermayer <michaelni at gmx.at> writes:
Hi,
> On Thu, Aug 16, 2007 at 07:37:24PM +0200, Marco Gerards wrote:
> [...]
>> > [...]
>> >
>> > dirac_arith.h:
>> > [...]
>> >> typedef struct dirac_arith_state {
>> >> /* Arithmetic decoding. */
>> >> unsigned int low;
>> >> unsigned int range;
>> >> unsigned int code;
>> >> unsigned int bits_left;
>> >> int carry;
>> >> unsigned int contexts[ARITH_CONTEXT_COUNT];
>> >>
>> >> GetBitContext *gb;
>> >> PutBitContext *pb;
>> >
>> > GetBitContext gb;
>> > PutBitContext pb;
>> > would avoid a pointer dereference if these 2 are used in the innermost
>> > loops, though maybe their use can be avoided by reading/writing 8bit
>> > at a time similar to how our cabac reader and range coder works
>>
>> Can I just copy the GetBitContext, PutBitContext at initialization
>> time, or is there a more efficient way to do this?
>
> no, its ok
Do you mean it is ok to copy them?
>>
>> As for using bytes, perhaps I could do this at a later time. I think
>> this is quite complex and thus time consuming for me. Or is this a
>> must have for you atm?
>
> no its not important, it can be done later
ok.
> [...]
>> >> { 4, 0, 1, 12, 16, 12, 16, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 768 },
>> >> { 4, 0, 1, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 },
>> >> { 4, 6, 1, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 },
>> >> { 4, 6, 0, 16, 24, 16, 24, 2, 1, 1, 1, 1, 1, 4, 3, 1, 1, 8, 6, 12, 8, 48, 48, 1024 }
>> >> };
>> >
>> > maybe char/short or (u)intXY_t could be used in the structs for these tables
>> > so as to reduce the size of the tables (except for things accessed in the
>> > innermost loops, there int might be faster on some architectures ...)
>>
>> Done.
>>
>> Is there some rule of thumb to use for this? So when to use ints so
>> it will be faster on same architectures vs. using small datatypes to
>> be more cache friendly?
>
> that is a good question
> i tend to use the smallest data type possible for arrays and int for
> everything else unless theres some other data type needed
>
> though thats just what i do, i dont know how often this is the best choice
> when in doubt and the code is importanat, benchmark (preferably on several
> architectures...)
Thanks, I keep that in mind :-)
>
> [...]
>> >
>> > [...]
>> >> /* Framerate. */
>> >> if (get_bits(gb, 1)) {
>> >> int idx = svq3_get_ue_golomb(gb);
>> >> if (! idx) {
>> >> s->source.frame_rate.num = svq3_get_ue_golomb(gb);
>> >> s->source.frame_rate.den = svq3_get_ue_golomb(gb);
>> >> } else {
>> >> /* Use a pre-set framerate. */
>> >> s->source.frame_rate = preset_frame_rates[idx - 1];
>> >> }
>> >
>> > idx should be checked against the table size
>>
>> I still have to do this. Why is this needed actually? For broken
>> encoders or so?
>
> well if you dont check the code can segfault, and there may be situations
> where this can be annoying though its not really important
> there are probably plenty of ways to make libavcodec segfault if someone
> wanted to achive that
ok
> [...]
>> > [...]
>> >> static int inline coeff_quant_factor(int idx) {
>> >> uint64_t base;
>> >> idx = FFMAX(idx, 0);
>> >
>> > is <0 allowed? if no this should be checked for somewhere and lead to failure
>> > to decode the frame
>>
>> I think it is. At least this check is required by the specification.
>
> if its allowed then iam curious how it can become negative, except by
> a too large golomb_ue which happens to set the sign bit ...
> this sounds a little odd to be intended behavior ...
Yes, it seems impossible. I'll just remove FFMAX, if you do not mind.
> [...]
>> >>
>> >> if (! magnitude)
>> >> return 0;
>> >
>> > assuming qfactor is not 0 this check could be done before the multiply
>>
>> It seems that it can become 0, although this seems a bit weird to me.
>
> i dont see how it could become 0 ...
You are right, I was reading too fast and thought "1 << 0 = 0". I
will make this change.
> [...]
>> >>
>> >> /* Check if there is a zero to the left and top left of this
>> >> coefficient. */
>> >> if (v > 0 && ((data[x + (y - 1) * s->padded_width])
>> >> || ( h > 0 && data[x + (y - 1) * s->padded_width - 1])))
>> >> return 0;
>> >> else if (h > 0 && data[x + y * s->padded_width - 1])
>> >> return 0;
>> >>
>> >> return 1;
>> >
>> > if you make the data array slightly bigger and give it a 0 border
>> > (this might be too hard)
>> > or if you have seperate functions for the border and middle(h>0,v>0)
>> > then this would be just a
>> > return !(data[-s->padded_width]|data[-1]);
>> > assuming data has already been corrected to point to x+y*padded_width
>>
>> In that can I would have to make two versions of the calling function,
>> right? Or I would have to introduce an if there, but that would be
>> moving the problem around, I think?
>
> there are many possible solutions ...
> 2 functions is one
>
> a
> static inline zero_neighbourhood(..., int border){
> }
>
> which is called as zero_neighbourhood(..., 0) or
> zero_neighbourhood(..., 1)
>
> would be anoter one, here gcc should with some luck inline the function
> and optimize the part out which isnt needed ...
Ok, I'll try that.
> [...]
>> > [...]
>> >> /**
>> >> * Predict the motion vector
>> >> *
>> >> * @param x horizontal position of the MC block
>> >> * @param y vertical position of the MC block
>> >> * @param ref reference frame
>> >> * @param dir direction horizontal=0, vertical=1
>> >> */
>> >> static int motion_vector_prediction(DiracContext *s, int x, int y,
>> >> int ref, int dir) {
>> >> int cnt = 0;
>> >> int left = 0, top = 0, lefttop = 0;
>> >>
>> >> if (x > 0) {
>> >> /* Test if the block to the left has a motion vector for this
>> >> reference frame. */
>> >> if (!s->blmotion[y * s->blwidth + x - 1].use_global
>> >> && s->blmotion[y * s->blwidth + x - 1].use_ref[ref]) {
>> >> left = s->blmotion[y * s->blwidth + x - 1].vect[ref][dir];
>> >> cnt++;
>> >> }
>> >>
>> >> /* This is the only reference, return it. */
>> >> if (y == 0)
>> >> return left;
>> >> }
>> >>
>> >> if (y > 0) {
>> >> /* Test if the block above the current one has a motion vector
>> >> for this reference frame. */
>> >> if (!s->blmotion[(y - 1) * s->blwidth + x].use_global
>> >> && s->blmotion[(y - 1) * s->blwidth + x].use_ref[ref]) {
>> >> top = s->blmotion[(y - 1) * s->blwidth + x].vect[ref][dir];
>> >> cnt++;
>> >> }
>> >>
>> >> /* This is the only reference, return it. */
>> >> if (x == 0)
>> >> return top;
>> >> }
>> >>
>> >> if (x > 0 && y > 0) {
>> >> /* Test if the block above the current one has a motion vector
>> >> for this reference frame. */
>> >> if (!s->blmotion[(y - 1) * s->blwidth + x - 1].use_global
>> >> && s->blmotion[(y - 1) * s->blwidth + x - 1].use_ref[ref]) {
>> >> lefttop = s->blmotion[(y - 1) * s->blwidth + x - 1].vect[ref][dir];
>> >> cnt++;
>> >> }
>> >> }
>> >>
>> >> /* No references for the prediction. */
>> >> if (cnt == 0)
>> >> return 0;
>> >>
>> >> if (cnt == 1)
>> >> return left + top + lefttop;
>> >>
>> >> /* Return the median of two motion vectors. */
>> >> if (cnt == 2)
>> >> return (left + top + lefttop + 1) >> 1;
>> >>
>> >> /* Return the median of three motion vectors. */
>> >> return mid_pred(left, top, lefttop);
>> >> }
>> >
>> > all the checks above are done twice, once with dir=0 then dir=1 maybe that
>> > can be avoided though it seems diracs design doesnt make that easy :(
>>
>> I am afraid that won't improve much and just makes the code unreadable
>> :-/. Would you mind if I leave the current code like it is?
>
> leave it, but flame the dirac devels for the crappy design they should
> store the dir=0 and dir=1 values nicely in pairs not oddly partitioned
:-)
> [...]
>> > [...]
>> >> /**
>> >> * IDWT transform (5,3) for a specific subband
>> >> *
>> >> * @param data coefficients to transform
>> >> * @param level level of the current transform
>> >> * @return 0 when successful, otherwise -1 is returned
>> >> */
>> >> static int dirac_subband_idwt_53(DiracContext *s,
>> >> int16_t *data, int level) {
>> >
>> > the whole (inverse) wavelet transform related code could be split into its
>> > own file
>>
>> What is the filename you propose? dirac_wavelet.[c]h or so? Or would
>> something more generic be more appropriate so snow and jpeg2000 can
>> share code or at least a file?
>
> its very easy to change the filename later in svn as well as git so it
> really doesnt matter dirac_wavelet.c/h sounds ok
ok
>>
>> >> int16_t *synth, *synthline;
>> >> int x, y;
>> >> int width = subband_width(s, level);
>> >> int height = subband_height(s, level);
>> >> int synth_width = width << 1;
>> >> int synth_height = height << 1;
>> >>
>> >> START_TIMER
>> >>
>> >> synth = av_malloc(synth_width * synth_height * sizeof(int16_t));
>> >> if (!synth) {
>> >> av_log(s->avctx, AV_LOG_ERROR, "av_malloc() failed\n");
>> >> return -1;
>> >> }
>> >
>> > theres no need to do a malloc() during idwt
>>
>> It is used for IDWT reordering. Do you mean that this can be done in
>> place?
>
> i meant that the malloc() can be done during init instead of during
> decode_frame()
ok
> and yes it can also be done in place too or the reordering can be
> done during the lifting ...
This sounds interesting, I will look into that.
>>
>> >
>> >>
>> >> dirac_subband_idwt_reorder(s, data, synth, level);
>> >>
>> >> /* LeGall(5,3)
>> >> First lifting step)
>> >> Even, predict, s=5, t_{-1}=-1, t_0=9, t_1=9, t_2=-1:
>> >> A[2*n] -= (-A[2*n-1] + A[2*n+1] + 2) >> 2
>> >>
>> >> Second lifting step)
>> >> Odd, update, s=1, t_0=1, t_1=1:
>> >> A[2*n+1] += (A[2*n] + A[2*n+2] + 1) >> 1
>> >> */
>> >>
>> >> /* Vertical synthesis: Lifting stage 1. */
>> >> synthline = synth;
>> >> for (x = 0; x < synth_width; x++) {
>> >> synthline[x] -= (synthline[synth_width + x]
>> >> + synthline[synth_width + x]
>> >> + 2) >> 2;
>> >> }
>> >> synthline = synth + (synth_width << 1);
>> >> for (y = 1; y < height - 1; y++) {
>> >> for (x = 0; x < synth_width; x++) {
>> >> synthline[x] -= (synthline[x - synth_width]
>> >> + synthline[x + synth_width]
>> >> + 2) >> 2;
>> >> }
>> >> synthline += (synth_width << 1);
>> >> }
>> >> synthline = synth + (synth_height - 2) * synth_width;
>> >> for (x = 0; x < synth_width; x++) {
>> >> synthline[x] -= (synthline[x - synth_width]
>> >> + synthline[x + synth_width]
>> >> + 2) >> 2;
>> >> }
>> >>
>> >> /* Vertical synthesis: Lifting stage 2. */
>> >> synthline = synth + synth_width;
>> >> for (x = 0; x < synth_width; x++)
>> >> synthline[x] += (synthline[x - synth_width]
>> >> + synthline[x + synth_width]
>> >> + 1) >> 1;
>> >> synthline = synth + (synth_width << 1);
>> >> for (y = 1; y < height - 1; y++) {
>> >> for (x = 0; x < synth_width; x++) {
>> >> synthline[x + synth_width] += (synthline[x]
>> >> + synthline[x + synth_width * 2]
>> >> + 1) >> 1;
>> >> }
>> >> synthline += (synth_width << 1);
>> >> }
>> >> synthline = synth + (synth_height - 1) * synth_width;
>> >> for (x = 0; x < synth_width; x++)
>> >> synthline[x] += (synthline[x - synth_width]
>> >> + synthline[x - synth_width]
>> >> + 1) >> 1;
>> >>
>> >>
>> >> /* Horizontal synthesis. */
>> >> synthline = synth;
>> >> for (y = 0; y < synth_height; y++) {
>> >>
>> >> /* Lifting stage 1. */
>> >> synthline[0] -= (synthline[1]
>> >> + synthline[1]
>> >> + 2) >> 2;
>> >> for (x = 1; x < width - 1; x++) {
>> >> synthline[2*x] -= (synthline[2*x - 1]
>> >> + synthline[2*x + 1]
>> >> + 2) >> 2;
>> >> }
>> >> synthline[synth_width - 2] -= (synthline[synth_width - 3]
>> >> + synthline[synth_width - 1]
>> >> + 2) >> 2;
>> >>
>> >> /* Lifting stage 2. */
>> >> for (x = 0; x < width - 1; x++) {
>> >> synthline[2*x + 1] += (synthline[2*x]
>> >> + synthline[2*x + 2]
>> >> + 1) >> 1;
>> >> }
>> >> synthline[synth_width - 1] += (synthline[synth_width - 2]
>> >> + synthline[synth_width - 2]
>> >> + 1) >> 1;
>> >> synthline += synth_width;
>> >> }
>> >
>> > as already said the stages can be interleaved so that only a single pass
>> > is done over the data, this should significantly speed the code up
>> > as its likely limited by memory speed ...
>>
>> Yes... I now tried this (for 5/3), but I am afraid I did this the
>> wrong way. I assume I have to be very careful and apply the second
>> stage only for lines that will not be used by the first stage anymore?
>
> yes, the stuff must be done in order
Ok.
> [...]
>> >
>> > this is VERY inefficient, most of what is done here does not need to be done
>> > per pixel or can be precalculated
>>
>> Absolutely! gprof agrees with you ;-)
>>
>> Actually, this comment applies to all code above, I assume.
>>
>> time seconds seconds calls ms/call ms/call name
>> 40.85 2.41 2.41 88939320 0.00 0.00 upconvert
>> 29.49 4.15 1.74 50 34.80 113.59 dirac_decode_frame
>> 10.34 4.76 0.61 417867 0.00 0.00 motion_comp_block1ref
>> 6.78 5.16 0.40 540 0.74 0.74 dirac_subband_idwt_53
>> 4.24 5.41 0.25 8689089 0.00 0.00 coeff_unpack
>> 4.07 5.65 0.24 8707719 0.00 0.00 dirac_arith_read_uint
>>
>> Luca and I talked about if I would concentrate on further
>> optimizations or on the encoder and he said he preferred having me
>> working on the encoder for now, because I know Dirac quite well by now
>> (Luca, please tell me if I am wrong in any way).
>>
>> So there is quite some room for optimizations. In the specification,
>> the used pseudo code looped over the pixels. My current code at least
>> doesn't do that. But I agree, the MC code is where a lot of
>> optimizations can be made:
>>
>> - The qpel/eightpel interpolation code is very inefficient and perhaps
>> has to be merged with motion_comp_block1ref and
>> motion_comp_block2ref. In that case it can be optimized. We
>> especially have to take care of border cases to avoid clipping, more
>> or less like I did for halfpel interpolation.
>
> simply making upconvert() work not with 1 pixel but with a block of pixels
> should make it alot more efficient
I moved this into motion_comp_block1ref/motion_comp_block2ref. This
works quite well and boosts the speed of the decoder by 50% and makes
further optimizations easier. Hopefully you do not have problems with
this approach.
>> Actually, this is the code which I dislike the most. But we had to
>> make a choise on how to spend the time. Is this a showstopper for
>> you?
>
> no, though its something which should be optimized its likely the worst
> part of the dirac code ATM
The latest commits make this code work in a reasonable speed. The is
still a lot of room for improvement, but it is not as terrible as it
used to me. I will improve this further after SoC.
>> I hope people will be interested in optimizing my code, vectorizing
>> it, etc. after summer of code. I certainly am as much as time permits
>> (I have to graduate soon...). So after FFmpeg, feel free to add me to
>> the MAINTAINERS list for Dirac, if you think I am up to it and if
>> you'd like that.
>
> add yourself now to maintainers, you have svn write access :)
Sure, but first we need something to maintain in svn, so I'll do this
after SoC.
Thanks,
Marco
More information about the FFmpeg-soc
mailing list