[FFmpeg-devel] [PATCH] h264 parallelized, (was: Parallelized h264 proof-of-concept)

Sat Jun 16 23:03:22 CEST 2007

Hi

Michael Niedermayer wrote:
> Hi
> 
> On Fri, Jun 15, 2007 at 10:10:54PM +0200, Andreas ?man wrote:
>> Andreas ?man wrote:
>>> Hi
>>>
>>> Michael Niedermayer wrote:
>>>> Hi
>>>>
>>>>
>>>> av_free() + av_malloc() or pass an argument to MPV_common_init()
>>> I'll extend MPV_common_init() with an additional argument then.
>>>
>>>>> static void filter_mb_fast( H264Context *h, int mb_x, int mb_y, uint8_t 
>>>>> *img_y, uint8_t *img_cb, uint8_t *img_cr, unsigned int linesize, 
>>>>> unsigned int uvlinesize);
>>>>> +static void execute_decode_slices(H264Context *h, int reset);
>>>> cant you order the new functions so as to avoid that?
>>> Hm, not really. decode_slice_header() needs to be able to
>>> fire off any pending slices in case the deblocking-type changes
>>> within a frame. (AFAIK this is valid according to the specs,
>>> perhaps I'm wrong?)
>>>
>> Okay, here are the finalized patches.
>>
>> #1 - Extend MPV_common_init() with an addition arg for context size
>>      when doing multi threading.
> 
> hmm, seeing the patch, i think i would prefer some simpler solution,

I tend to agree, this is a bit too intrusive. But i thought
you were okay with it.

> maybe adding the size to MpegEncContext? or even better adding
> thread_context[] to H264Context, this would also avoid the casts to
> H264Context

Personally, i like the idea of having mpegvideo.c do most of the
thread context setup. In the long run i guess it will be less
code duplication and such things.

How about just letting MPV_common_init use ctx->codec->priv_data_size
for malloc'ing the contexts? Or is that too inflexible, in that case
I tend to favor adding "priv_data_size" to MpegEncContext

>> #4 - Slice level parallelism for deblocking type 0 and 2
>>
>> regression tests passes
> 
> regression tests ? theres no h.264 regression test ...
> i assume you mean you tested this on several h.264 streams and the
> output is binary identical ...

It was more to make sure that i didn't screw anything else up when
doing the MPV_common_init() thingie.

I've tested with all the h264 samples from mphq and they
are all md5sum identical. I've also got a few off-list replies
with people testing the code whom reported back various issues
which I've fixed (Thanks to all of you).

> [...]
>> @@ -3022,8 +3067,18 @@
>>      MpegEncContext * const s = &h->s;
>>      int temp8, i;
>>      uint64_t temp64;
>> -    int deblock_left = (s->mb_x > 0);
>> -    int deblock_top  = (s->mb_y > 0);
>> +    int deblock_left;
>> +    int deblock_top;
>> +    int mb_xy;
>> +
>> +    if(h->deblocking_filter == 2) {
>> +        mb_xy = s->mb_x + s->mb_y*s->mb_stride;
>> +        deblock_left = h->slice_table[mb_xy] == h->slice_table[mb_xy - 1];
>> +        deblock_top  = h->slice_table[mb_xy] == h->slice_table[h->top_mb_xy];
>> +    } else {
>> +        deblock_left = (s->mb_x > 0);
>> +        deblock_top = (s->mb_y > 0);
>> +    }
> 
> is this multitrheading specific? or a deblocking_filter == 2 fix? in the later
> case it should be in a seperate patch

It's required for multi-threading to work correctly. Otherwise,
the xchg-code might incorrectly swap in invalid data since
no deblocking has yet completed for the previous row
(at least, thats how I understand it)

Having said that, it does give a slight performance improvement
for deblocking_type == 2, so i guess it could go in as a separate patch.

> [...]
>>      if(!FRAME_MBAFF){
>>          int qp_thresh = 15 - h->slice_alpha_c0_offset - FFMAX(0, h->pps.chroma_qp_index_offset);
>>          int qp = s->current_picture.qscale_table[mb_xy];
>> -        if(qp <= qp_thresh
>> -           && (mb_x == 0 || ((qp + s->current_picture.qscale_table[mb_xy-1] + 1)>>1) <= qp_thresh)
>> -           && (mb_y == 0 || ((qp + s->current_picture.qscale_table[h->top_mb_xy] + 1)>>1) <= qp_thresh)){
>> -            return;
>> +
>> +        if(qp <= qp_thresh) {
>> +            if(h->deblocking_filter == 1) {
>> +                if((mb_x == 0 || ((qp + s->current_picture.qscale_table[mb_xy-1]      + 1)>>1) <= qp_thresh) &&
>> +                   (mb_y == 0 || ((qp + s->current_picture.qscale_table[h->top_mb_xy] + 1)>>1) <= qp_thresh))
>> +                    return;
>> +            } else {
>> +                int mb_slice = h->slice_table[mb_xy];
>> +                int left_qp, top_qp;
>> +
>> +                left_qp = h->slice_table[mb_xy - 1]    == mb_slice ? ((qp + s->current_picture.qscale_table[mb_xy-1]      + 1)>>1) : 0;
>> +                top_qp  = h->slice_table[h->top_mb_xy] == mb_slice ? ((qp + s->current_picture.qscale_table[h->top_mb_xy] + 1)>>1) : 0;
>> +                if(left_qp <= qp_thresh &&
>> +                   top_qp  <= qp_thresh)
>> +                    return;
>> +            }
>>          }
>>      }
> 
> is this specific to threads?

Hm .. now when i come to think of it, it might not be needed at all.

The only time when the current code would incorrectly leave out a
deblocking round is when it happens on a slice boundary (the previous
slice may not be complete yet, and thus
qscale_table[mb_xy - <somevalue>] may be incorrect (it's still the
value from the previous frame).

These cases will be ruled out anyway, a bit further down
in the code. I'll rip this change out.

>> +
>> +    h->max_contexts = avctx->thread_count > 0 ? avctx->thread_count : 1;
> 
> i think thread_count must be >0

Hmm, set by whom? The application certainly doesn't need to and AFAIK
lavc doesn't peg it up to 1 either.

> 
> 
> [...]
>> +        hx = (H264Context *)s->thread_context[h->current_context] ? (H264Context *)s->thread_context[h->current_context] : h;
> 
> what about s->thread_context[0] == h, i think that would avoid this check?
> or does that cause other parts of the code to become more complex?

s->thread_context[0] is initialized during MPV_common_init(),
which is not called until the first slice is processed.
That's why I added this check. But you are right, i'm gonna add

s->thread_context[0] = h;

to decode_init() to avoid this check.

Good point.

> 
> 
> [...]
>>          case NAL_DPA:
>> -            init_get_bits(&s->gb, ptr, bit_length);
>> -            h->intra_gb_ptr=
>> -            h->inter_gb_ptr= NULL;
>> -            s->data_partitioning = 1;
>> +             init_get_bits(&hx->s.gb, ptr, bit_length);
>> +             hx->intra_gb_ptr=
>> +             hx->inter_gb_ptr= NULL;
>> +             hx->s.data_partitioning = 1;
>>  
>> -            if(decode_slice_header(h) < 0){
>> +            if(decode_slice_header(hx, h) < 0){
>>                  av_log(h->s.avctx, AV_LOG_ERROR, "decode_slice_header error\n");
>>              }
> 
> indention is wrong here

I'll look into it. btw parts of decode_nal_units() in svn is incorrectly
indented, wouldn't it be wise to fix that up? Should i send a patch?

> 
> 
> [...]
>> Index: libavcodec/h264.h
>> ===================================================================
>> --- libavcodec/h264.h	(revision 9281)
>> +++ libavcodec/h264.h	(working copy)
>> @@ -381,6 +381,16 @@
>>      const uint8_t *field_scan8x8_cavlc_q0;
>>  
>>      int x264_build;
>> +
>> +    /* Slice-based multi threading members.
>> +     * These are only used in the "master" context */
> 
> doxygen supports comments on groups of variables, this should be used
> here

Ok