[FFmpeg-devel] [PATCH] Bink Video decoder for FFmpeg 0.6

Tue Feb 16 20:45:31 CET 2010

I'll try my hand at some patch review.

+    do {
+        if (!get_bits1(gb)) {
+            *dst++ = *src++;
+            size--;
+        } else {
+            *dst++ = *src2++;
+            size2--;
+        }
+    } while (size && size2);

Maybe dst++ should be moved out of the if?  I don't trust gcc to
optimize away the duplicated increment that would otherwise be in each
branch.

+    while (size--)
+        *dst++ = *src++;
+    while (size2--)
+        *dst++ = *src2++;

This looks like a memcpy to me.

+            v = -v;

Look at some of the existing decoders for better ways to branchlessly
negate a value based on a get_bits1(gb) call.

+            if (v & 0x80)
+                v = -(v & 0x7F);

Same here.

+            for (j = 0; j < len2; j++)
+                *dst++ = v;

This looks like a memset to me.

+                    mode_list[list_pos] = ((ccoef + 4) << 2) | 1;

+1 I would guess is better here than |1 because it can be done in an
lea instruction (due to the shift).  On any non-x86 it'll be neutral.
Same with all similar lines.

+            switch (mode) {
+            case 0:
+            case 2:
+                if (!mode) {
+                    mode_list[list_pos] = ((ccoef + 4) << 2) | 1;
+                } else {
+                    mode_list[list_pos++] = 0;
+                }

Yes, this was a good joke, I laughed too, but the if statement can be
probably simplified slightly if you move the first if above "case 2:"
;)  Same with residuals section.

+                        if (!bits) {
+                            t = get_bits1(gb) ? -1 : 1;
+                        } else {
+                            t = get_bits(gb, bits) | mask;
+                            if (get_bits1(gb))
+                                t = -t;
+                        }

Is get_bits valid with bits=0?  If so, this can be simplified.

Also, again with the ? -1 : 1 and t = -t and so forth: make it
branchless if gcc doesn't already.

+#define PUT2x2(dst, stride, x, y, pix) \
+    dst[(x)*2 +     (y)*2       * stride] = \
+    dst[(x)*2 + 1 + (y)*2       * stride] = \
+    dst[(x)*2 +     ((y)*2 + 1) * stride] = \
+    dst[(x)*2 + 1 + ((y)*2 + 1) * stride] = pix;
+

Write-combining would be appropriate here with intreadwrite macros,
depending on whether or not the block is unaligned.

+{
+    c->dsp.get_pixels(block, src, stride);
+    c->dsp.put_pixels_clamped(block, dst, stride);
+}

Does this even make any sense?  Wouldn't the pixels already be clamped
if you're copying them?  If you may in some cases be adding on top of
existing pixels, maybe split that case out?  There's huge inefficiency
going on here if only because of the constant and pointless conversion
to DCTELEM.

+        bw = plane ? (avctx->width  + 15) >> 4 : (avctx->width  + 7) >> 3;
+        bh = plane ? (avctx->height + 15) >> 4 : (avctx->height + 7) >> 3;

(avctx->height + (16>>!!plane)-1) >> (4-!!plane) ?

+        if (!plane || !c->swap_planes)
+            plane_idx = plane;
+        else
+            plane_idx = plane ^ 3;

I'd ternary this.

if ((by & 1) && blk == 1) {
+                    bx++;
+                    dst  += 8;
+                    prev += 8;
+                    continue;
+                }

What is "1"?  Right after this, you have a switch that uses enums for
blk, so it isn't clear.  Why isn't this part of the switch?

+                                    PUT2x2(dst, stride, pos & 7, pos >> 3, v);

Is it really efficient code-wise to add this extra &7/>>3 logic just
to make the scan array a little bit smaller?

+                        for (j = 0; j < 8; j++)
+                            for (i = 0; i < 8; i++)
+                                PUT2x2(dst, stride, i, j, block[i + j*8]);

OK, this is seriously inefficient now.  The run stuff was bad enough:
I think we need a DSP function that takes an 8x8 array and outputs it
to 16x16.  This would be so much faster in asm.  Split it out, put it
in a DSP function, port all the code to use it (including run stuff),
and then we can write asm later.

+                        for (j = 0; j < 16; j++)
+                            memset(dst + j*stride, v, 16);

Similarly here.  If we don't have a DSP function to fill a block with
a constant value, we should make one.

+                    default:
+                        av_log(avctx, AV_LOG_ERROR, "Incorrect 16x16
block type %d\n", blk);
+                        return -1;
+                    }

Is it even possible for this code to trigger given the limited values
that blk can get when reading from even an invalid bitstream?

+                    copy_block(c, block, ref, dst, stride);
+                    c->dsp.clear_block(block);
+                    v = get_bits(&gb, 7);
+                    read_residue(&gb, block, v);
+                    c->dsp.add_pixels_clamped(block, dst, stride);

This seems wasteful as well.  You're treating pixels like DCT
coefficients and using DCT coefficient functions on them, which seems
silly.

+                    for (i = 0; i < 64; i++)
+                        block[i] = (block[i] * quant[i]) >> 11;

Why isn't this done as part of the decode residual function?  You're
dequanting all 64 coefficients, even though most will be zero.

Pass done.

Dark Shikari