[MPlayer-dev-eng] [PATCH] VF Overlay
Reimar Döffinger
Reimar.Doeffinger at gmx.de
Fri Aug 7 10:18:11 CEST 2009
On Thu, Aug 06, 2009 at 11:51:57PM +0200, Benjamin Zores wrote:
> +/**
> + * \brief Array of all vf_overlay instances private data.
> + *
> + * Keep track of filter instances private data because the overlay buffer
> + * should be able to survive a loadfile or loop, so when the filter is
> + * initialized, we first check to see if we have an existing filter
> + * associated with the specified shared memory key and use that instead.
> + *
> + * As a result, vf_overlay instances are "persistent" (i.e., they don't get
> + * uninitialized). Consequently, the global variables below apply to all
> + * vf_overlay instances.
> + */
> +static struct vf_priv_s **vf_overlay_priv = NULL;
> +/// Number of vf_overlay instances
> +static int num_instances = 0;
I guess this causes all kind of issues when someone adds multiple -vf
overlay etc.?
> +#if HAVE_MMX
> +static uint64_t attribute_used __attribute__((aligned(8))) MM_global_alpha;
> +static uint64_t attribute_used __attribute__((aligned(8))) MM_ROUND = C64(0x80);
> +#endif
IMO please use the stuff from ffmpeg like DECLARE_ALIGNED, and I think
the second one should be const.
> +#define alloc(buf, size) \
> + buf = (uint8_t *)memalign(16, size); \
Useless (or actually hurtful) cast
> + if (!buf) return 0
I don't think this kind of thing is ever good. It hides the error
check, it is questionable if return 0 really is always the best thing to
do and it doesn't print any kind of message.
Actually, that return 0 will cause the memory from all previous
allocations to leak, so it clearly is wrong, so this indeed is yet
another case of "make broken code look good with a macro".
> +#define dealloc(buf) \
> + if (buf) { \
> + free(buf); \
> + buf = 0; \
> + }
Anyway I think those macros are extremely ugly, and the dealloc could be
a function anyway (just like av_freep).
> +static int alloc_overlay_data(struct vf_priv_s *priv)
> +{
> + int w = priv->mpi_stride;
> + int h = priv->mpi_h;
> +
> + alloc(priv->y, w * h );
> + alloc(priv->u, w * h / 4);
> + alloc(priv->v, w * h / 4);
> + alloc(priv->a, w * h );
> + alloc(priv->uva, w * h / 4);
> +
> + /* Buffers for alpha-multiplied pixels */
> + alloc(priv->pre_y, w * h );
> + alloc(priv->pre_u, w * h / 4);
> + alloc(priv->pre_v, w * h / 4);
> + alloc(priv->pre_a, w * h );
> + alloc(priv->pre_uva, w * h / 4);
> +
> + alloc(priv->alpha_imgbuf, priv->w * priv->h);
> +
> + /* Holds BGR24 version of the image buffer. We hold one extra byte
> + * because in convert_bgra_to_yv12a() we copy 4 bytes at a time, but
> + * offset only 3 bytes. This is faster than 3 (or 2) copies, but means
> + * we need an extra byte so we're staying within the allocated buffer.
> + */
> + alloc(priv->bgr24_imgbuf, priv->w * priv->h * 3 + 1);
You could just do all
... = memalign(...)
... = memalign(...)
and then just check once
if (!priv->y || !priv->u || ...) {
free everything
return 0;
}
Though personally I'd favour
priv->y = memalign(...);
if (!priv->y) goto fail;
priv->u = memalign(...);
if (!priv->u) goto fail;
...
return 1;
fail:
free(priv->y);
priv->y = NULL;
...
return 0;
Actually you might be able to just call free_overlay_data in the fail
case, but you'd have to carefully verify that.
> + shmdt((uint8_t *)priv->lockbyte);
> + priv->lockbyte = 0;
Why the cast? And I am in favour of using NULL for pointers instead of
0.
> +/** \brief Free all buffers for all overlay filter instances.
> + *
> + * Because vf_overlay instances must survive a loadfile or loop, vf_uninit is
> + * not specified. Therefore, when the first vf_overlay instance is created,
> + * this function is registered with atexit(3), so that the shared memory
> + * segment allocated in vf_config is properly deleted. The overlay buffers are
> + * also freed in the call to free_overlay_data. (Although this is not strictly
> + * necessary since we are shutting down at this point, it is called for
> + * correctness.)
atexit is a horrible mess, "correctness" is not sufficient justification
to use it IMO.
> + priv->w = (d_width + 1) & ~1;
> + priv->h = (d_height + 1) & ~1;
> + priv->mpi_w = (width + 1) & ~1;
> + priv->mpi_h = (height + 1) & ~1;
FFALIGN(..., 2);
> + char *accel_str;
> + accel_str = "no acceleration";
The type is wrong, this should be causing a warning.
accel_str must be const char *
> +static inline void translate_coords(struct vf_priv_s *priv,
> + int *x, int *y, int *w, int *h)
> +{
> + float xdiff = (float)priv->w / priv->mpi_w;
> + float ydiff = (float)priv->h / priv->mpi_h;
> +
> + if (x)
> + *x = (int)((float)*x / xdiff);
> +
> + if (w) {
> + *w = (int)((float)*w / xdiff);
> + *w = FFMIN (*w, priv->mpi_w);
> + }
> +
> + if (y)
> + *y = (int)((float)*y / ydiff);
> +
> + if (h) {
> + *h = (int)((float)*h / ydiff);
> + *h = FFMIN (*h, priv->mpi_h);
> + }
First, unless necessary it is bad style to do
c = a / b;
d = e / c;
when you can save one division by using
c = b /a;
d = e * c;
Second, there are quite a few places that have similar code, and MPlayer
I think usually uses just ints, e.g.:
*x = *x * priv->mpi_w / priv->w;
> +#define check_opaque(type) \
> + if (*(type*)(p + x)) { \
> + if (slice_y1 == -2) \
> + slice_y1 = y; \
> + else \
> + slice_y2 = y; \
> + x = row_stride; \
> + break; \
> + }
> +
> + for (y = 0; y < h; y++) {
> + for (x = 0; x < row_stride - 7; x += 8)
> + check_opaque(uint64_t);
> + for (; x < row_stride - 3; x += 4)
> + check_opaque(uint32_t);
> + for (; x < row_stride - 1; x += 2)
> + check_opaque(uint16_t);
I find it very hard to imagine that the extra uint32_t case gains even the
slightest bit of speed.
Also the macro is again used to hide code duplication, not sure if there
is a nice way to avoid it, but it should at least be tried.
> + i = priv->h != priv->mpi_h ? 4 : 0;
> + do {
> + dst_y = ry = av_clip((orig_y - i) & ~1, 0, priv->h);
> + dst_h = rh = av_clip((orig_h + 1 + i*2) & ~1, 0, priv->h - ry);
Is FFALIGN appropriate for any of these? A comment what exactly it
calculates would be good IMO.
> + translate_coords(priv, 0, &dst_y, 0, &dst_h);
> + i += 2;
> + } while (dst_y % 2 != 0 && i < ry);
dst_y & 1, don't tempt the compiler to actually do a division...
> + for (i = priv->w * rh; i>0; i--, p_bgr32 += 4, p_alpha++, p_bgr24 += 3) {
> + /* Moving 32 bits is faster than 3 separate assignments (or one 16
> + * bit and and one 8 bit move). The BGR24 buffer has one extra byte
> + * allocated to prevent an overrun.
> + */
> + *(uint32_t *)p_bgr24 = *(uint32_t *)p_bgr32;
You can't do that, p_bgr24 is not aligned.
> + /* Dest is YV12 buffers offset to top of slice. */
> + dst[0] = priv->y + (dst_y * priv->mpi_stride);
> + dst[1] = priv->u + ((dst_y * priv->mpi_stride) >> 2);
> + dst[2] = priv->v + ((dst_y * priv->mpi_stride) >> 2);
I think this should be (dst_y >> 1) * dst_strides[1]
because
1) dst_strides[1] and [2] are calculated anyway, so why not use it,
it gives the thing a proper name
2) While it is a given here, it removes the assumption of dst_y and
mpi_stride being divisible by two (which is necessary to merge the two
>> 1 into one >> 2) and fewer assumptions are always good.
> + /* Source is overlay-sized alpha plane offset to top of slice. */
> + src[0] = priv->alpha_imgbuf + (ry * priv->w);
> + src[1] = src[2] = 0;
...
> + dst[1] = dst[2] = 0;
NULL is nicer IMO.
> + /* Round sizes up to multiples of 2. */
> + r->w = (w + 1) & ~1;
> + r->h = (h + 1) & ~1;
FFALIGN
> + /* Ensure coordinates are within the overlay image boundaries */
> + r->x = FFMAX (r->x, 0);
> + r->x = FFMIN (r->x, priv->w);
> +
> + r->y = FFMAX (r->y, 0);
> + r->y = FFMIN (r->y, priv->h);
> +
> + r->w = FFMAX (r->w, 0);
> + r->w = FFMIN (r->w, priv->w - r->x);
> +
> + r->h = FFMAX (r->h, 0);
> + r->h = FFMIN (r->h, priv->h - r->y);
Why suddenly the spaces before the (?
> +/// Blends src on top of dst at the given alpha level.
> +#define blend_byte(dst, src, alpha) multiply_alpha(dst, alpha) + src;
can be a static inline function, too.
> + premultiply_alpha_byte(*(byte++), *(alpha++),
The () are not necessary.
> + asm volatile(
> + "pxor %%mm7, %%mm7\n\t" // zero out %mm7
> + "pcmpeqb %%mm4, %%mm4\n\t" // %mm4 = 255's
> + "movq (%3), %%mm5\n\t" // %mm5 = alpha
> + "cmp $255, %4\n\t" // don't apply layer alpha if it's 100% opaque
> + "je 42f\n\t"
Just use 0, not 42, it's needlessly confusing.
I also think it would be far more readable if the operands were nicely
aligned.
> + : "+r" (dst_byte), // %0
> + "+r" (dst_alpha) // %1
I think most likely these must be early-clobber (&)
> + for (x = 0; x < (rw & ~7); x += 8)
Code above used - 7. It doesn't really matter if - 7 or & ~7, but it
should be the same everywhere.
> + *(dst + x) = blend_byte(*(src+x), *(overlay+x), *(alpha+x));
I'd say it looks much nicer like this:
dst[x] = blend_byte(src[x], overlay[x], alpha[x]);
> + int i, y, q = w / 8, r = w % 8;
Ugh, why use division?
> + x86_reg wr = (x86_reg) w;
Useless cast?
> + : "+r" (dst), // %0
> + "+r" (src), // %1
> + "+r" (overlay), // %2
> + "+r" (alpha) // %3
Probably early-clobber?
> + *(dst+i) = blend_byte(*(src+i), *(overlay+i), *(alpha+i));
See above for less-ugly...
And now I am too lazy to review the rest.
More information about the MPlayer-dev-eng
mailing list