[MPlayer-dev-eng] [PATCH] VF Overlay

Fri Aug 7 10:18:11 CEST 2009

On Thu, Aug 06, 2009 at 11:51:57PM +0200, Benjamin Zores wrote:
> +/**
> + * \brief Array of all vf_overlay instances private data.
> + *
> + * Keep track of filter instances private data because the overlay buffer
> + * should be able to survive a loadfile or loop, so when the filter is
> + * initialized, we first check to see if we have an existing filter
> + * associated with the specified shared memory key and use that instead.
> + *
> + * As a result, vf_overlay instances are "persistent" (i.e., they don't get
> + * uninitialized). Consequently, the global variables below apply to all
> + * vf_overlay instances.
> + */
> +static struct vf_priv_s **vf_overlay_priv = NULL;
> +/// Number of vf_overlay instances
> +static int num_instances = 0;

I guess this causes all kind of issues when someone adds multiple -vf
overlay etc.?

> +#if HAVE_MMX
> +static uint64_t attribute_used __attribute__((aligned(8))) MM_global_alpha;
> +static uint64_t attribute_used __attribute__((aligned(8))) MM_ROUND = C64(0x80);
> +#endif

IMO please use the stuff from ffmpeg like DECLARE_ALIGNED, and I think
the second one should be const.

> +#define alloc(buf, size) \
> +        buf = (uint8_t *)memalign(16, size); \

Useless (or actually hurtful) cast

> +        if (!buf) return 0

I don't think this kind of thing is ever good. It hides the error
check, it is questionable if return 0 really is always the best thing to
do and it doesn't print any kind of message.
Actually, that return 0 will cause the memory from all previous
allocations to leak, so it clearly is wrong, so this indeed is yet
another case of "make broken code look good with a macro".

> +#define dealloc(buf) \
> +        if (buf) { \
> +            free(buf); \
> +            buf = 0; \
> +        }

Anyway I think those macros are extremely ugly, and the dealloc could be
a function anyway (just like av_freep).

> +static int alloc_overlay_data(struct vf_priv_s *priv)
> +{
> +    int w = priv->mpi_stride;
> +    int h = priv->mpi_h;
> +
> +    alloc(priv->y,       w * h    );
> +    alloc(priv->u,       w * h / 4);
> +    alloc(priv->v,       w * h / 4);
> +    alloc(priv->a,       w * h    );
> +    alloc(priv->uva,     w * h / 4);
> +
> +    /* Buffers for alpha-multiplied pixels */
> +    alloc(priv->pre_y,   w * h    );
> +    alloc(priv->pre_u,   w * h / 4);
> +    alloc(priv->pre_v,   w * h / 4);
> +    alloc(priv->pre_a,   w * h    );
> +    alloc(priv->pre_uva, w * h / 4);
> +
> +    alloc(priv->alpha_imgbuf, priv->w * priv->h);
> +
> +    /* Holds BGR24 version of the image buffer. We hold one extra byte
> +     * because in convert_bgra_to_yv12a() we copy 4 bytes at a time, but
> +     * offset only 3 bytes. This is faster than 3 (or 2) copies, but means
> +     * we need an extra byte so we're staying within the allocated buffer.
> +     */
> +    alloc(priv->bgr24_imgbuf, priv->w * priv->h * 3 + 1);

You could just do all
... = memalign(...)
... = memalign(...)

and then just check once
if (!priv->y || !priv->u || ...) {
  free everything
  return 0;
}

Though personally I'd favour
priv->y = memalign(...);
if (!priv->y) goto fail;
priv->u = memalign(...);
if (!priv->u) goto fail;
...
return 1;
fail:
free(priv->y);
priv->y = NULL;
...
return 0;

Actually you might be able to just call free_overlay_data in the fail
case, but you'd have to carefully verify that.

> +        shmdt((uint8_t *)priv->lockbyte);
> +        priv->lockbyte = 0;

Why the cast? And I am in favour of using NULL for pointers instead of
0.

> +/** \brief Free all buffers for all overlay filter instances.
> + *
> + * Because vf_overlay instances must survive a loadfile or loop, vf_uninit is
> + * not specified. Therefore, when the first vf_overlay instance is created,
> + * this function is registered with atexit(3), so that the shared memory
> + * segment allocated in vf_config is properly deleted. The overlay buffers are
> + * also freed in the call to free_overlay_data. (Although this is not strictly
> + * necessary since we are shutting down at this point, it is called for
> + * correctness.)

atexit is a horrible mess, "correctness" is not sufficient justification
to use it IMO.

> +    priv->w = (d_width  + 1) & ~1;
> +    priv->h = (d_height + 1) & ~1;
> +    priv->mpi_w      = (width  + 1) & ~1;
> +    priv->mpi_h      = (height + 1) & ~1;

FFALIGN(..., 2);

> +    char *accel_str;
> +    accel_str = "no acceleration";

The type is wrong, this should be causing a warning.
accel_str must be const char *

> +static inline void translate_coords(struct vf_priv_s *priv,
> +                                    int *x, int *y, int *w, int *h)
> +{
> +    float xdiff = (float)priv->w / priv->mpi_w;
> +    float ydiff = (float)priv->h / priv->mpi_h;
> +
> +    if (x)
> +        *x = (int)((float)*x / xdiff);
> +
> +    if (w) {
> +        *w = (int)((float)*w / xdiff);
> +        *w = FFMIN (*w, priv->mpi_w);
> +    }
> +
> +    if (y)
> +        *y = (int)((float)*y / ydiff);
> +
> +    if (h) {
> +        *h = (int)((float)*h / ydiff);
> +        *h = FFMIN (*h, priv->mpi_h);
> +    }

First, unless necessary it is bad style to do
c = a / b;
d = e / c;
when you can save one division by using
c = b /a;
d = e * c;

Second, there are quite a few places that have similar code, and MPlayer
I think usually uses just ints, e.g.:
*x = *x * priv->mpi_w / priv->w;

> +#define check_opaque(type) \
> +    if (*(type*)(p + x)) { \
> +        if (slice_y1 == -2) \
> +            slice_y1 = y; \
> +        else \
> +            slice_y2 = y; \
> +        x = row_stride; \
> +        break; \
> +    }
> +
> +    for (y = 0; y < h; y++) {
> +        for (x = 0; x < row_stride - 7; x += 8)
> +            check_opaque(uint64_t);
> +        for (; x < row_stride - 3; x += 4)
> +            check_opaque(uint32_t);
> +        for (; x < row_stride - 1; x += 2)
> +            check_opaque(uint16_t);

I find it very hard to imagine that the extra uint32_t case gains even the
slightest bit of speed.
Also the macro is again used to hide code duplication, not sure if there
is a nice way to avoid it, but it should at least be tried.

> +    i = priv->h != priv->mpi_h ? 4 : 0;
> +    do {
> +        dst_y = ry = av_clip((orig_y - i) & ~1, 0, priv->h);
> +        dst_h = rh = av_clip((orig_h + 1 + i*2) & ~1, 0, priv->h - ry);

Is FFALIGN appropriate for any of these? A comment what exactly it
calculates would be good IMO.

> +        translate_coords(priv, 0, &dst_y, 0, &dst_h);
> +        i += 2;
> +    } while (dst_y % 2 != 0 && i < ry);

dst_y & 1, don't tempt the compiler to actually do a division...

> +    for (i = priv->w * rh; i>0; i--, p_bgr32 += 4, p_alpha++, p_bgr24 += 3) {
> +        /* Moving 32 bits is faster than 3 separate assignments (or one 16
> +         * bit and and one 8 bit move). The BGR24 buffer has one extra byte
> +         * allocated to prevent an overrun.
> +         */
> +        *(uint32_t *)p_bgr24 = *(uint32_t *)p_bgr32;

You can't do that, p_bgr24 is not aligned.

> +    /* Dest is YV12 buffers offset to top of slice. */
> +    dst[0] = priv->y + (dst_y * priv->mpi_stride);
> +    dst[1] = priv->u + ((dst_y * priv->mpi_stride) >> 2);
> +    dst[2] = priv->v + ((dst_y * priv->mpi_stride) >> 2);

I think this should be (dst_y >> 1) * dst_strides[1]
because
1) dst_strides[1] and [2] are calculated anyway, so why not use it,
it gives the thing a proper name
2) While it is a given here, it removes the assumption of dst_y and
mpi_stride being divisible by two (which is necessary to merge the two
>> 1 into one >> 2) and fewer assumptions are always good.

> +    /* Source is overlay-sized alpha plane offset to top of slice. */
> +    src[0] = priv->alpha_imgbuf + (ry * priv->w);
> +    src[1] = src[2] = 0;
...
> +    dst[1] = dst[2] = 0;

NULL is nicer IMO.

> +    /* Round sizes up to multiples of 2. */
> +    r->w = (w + 1) & ~1;
> +    r->h = (h + 1) & ~1;

FFALIGN

> +    /* Ensure coordinates are within the overlay image boundaries */
> +    r->x = FFMAX (r->x, 0);
> +    r->x = FFMIN (r->x, priv->w);
> +
> +    r->y = FFMAX (r->y, 0);
> +    r->y = FFMIN (r->y, priv->h);
> +
> +    r->w = FFMAX (r->w, 0);
> +    r->w = FFMIN (r->w, priv->w - r->x);
> +
> +    r->h = FFMAX (r->h, 0);
> +    r->h = FFMIN (r->h, priv->h - r->y);

Why suddenly the spaces before the (?

> +/// Blends src on top of dst at the given alpha level.
> +#define blend_byte(dst, src, alpha) multiply_alpha(dst, alpha) + src;

can be a static inline function, too.

> +        premultiply_alpha_byte(*(byte++), *(alpha++),

The () are not necessary.

> +    asm volatile(
> +        "pxor %%mm7, %%mm7\n\t"                // zero out %mm7
> +        "pcmpeqb %%mm4, %%mm4\n\t"             // %mm4 = 255's
> +        "movq (%3), %%mm5\n\t"        // %mm5 = alpha
> +        "cmp $255, %4\n\t"           // don't apply layer alpha if it's 100% opaque
> +        "je 42f\n\t"

Just use 0, not 42, it's needlessly confusing.
I also think it would be far more readable if the operands were nicely
aligned.

> +        :  "+r" (dst_byte),             // %0
> +        "+r" (dst_alpha)             // %1

I think most likely these must be early-clobber (&)

> +        for (x = 0; x < (rw & ~7); x += 8)

Code above used - 7. It doesn't really matter if - 7 or & ~7, but it
should be the same everywhere.

> +            *(dst + x) = blend_byte(*(src+x), *(overlay+x), *(alpha+x));

I'd say it looks much nicer like this:

dst[x] = blend_byte(src[x], overlay[x], alpha[x]);

> +    int i, y, q = w / 8, r = w % 8;

Ugh, why use division?

> +    x86_reg wr = (x86_reg) w;

Useless cast?

> +                : "+r" (dst),         // %0
> +                "+r" (src),           // %1
> +                "+r" (overlay),       // %2
> +                "+r" (alpha)          // %3

Probably early-clobber?

> +                *(dst+i) = blend_byte(*(src+i), *(overlay+i), *(alpha+i));

See above for less-ugly...

And now I am too lazy to review the rest.