[MPlayer-dev-eng] [PATCH] VF Overlay

Sat Aug 8 01:56:07 CEST 2009

On Sat, 2009-08-08 at 00:25 +0200, Benjamin Zores wrote:
> Thoug, I'm not at all original patch author, hence some parts
> of it are black magic to me. All I did was removing some controversial
> parts and fixing some stuff.

I'll try to be as much help as possible.

> > I guess this causes all kind of issues when someone adds multiple -vf
> > overlay etc.?
> 
> After a few thoughts, I've removed this array of overlays feature.
> This was pretty useless.

Yeah, I'm also not sure what I was thinking at the time.  The code was
there specifically to support multiple overlays (but global references
kept to each instance, because having the overlay contents survive a
loadfile or loop was a design goal), but that's clearly not going to
work because slave commands would then get processed by all instances.

In that case, I guess the right thing to do is just error out if
multiple overlay instances are specified.

> >> +    for (y = 0; y < h; y++) {
> >> +        for (x = 0; x < row_stride - 7; x += 8)
> >> +            check_opaque(uint64_t);
> >> +        for (; x < row_stride - 3; x += 4)
> >> +            check_opaque(uint32_t);
> >> +        for (; x < row_stride - 1; x += 2)
> >> +            check_opaque(uint16_t);
> > 
> > I find it very hard to imagine that the extra uint32_t case gains even the
> > slightest bit of speed.
> 
> What does that mean ?

What he means is that looping over the line 8 bytes at a time and then
finishing up the remainder (which is not visible by 8) 2 bytes at a time
thereafter is probably not going to be measurably slower than doing 8,
then 4, then 2.

It might even be faster.  But it's probably not worth going through the
effort of measuring, so you can just remove the for loop with
check_opaque(uint32_t);

> >> +    i = priv->h != priv->mpi_h ? 4 : 0;
> >> +    do {
> >> +        dst_y = ry = av_clip((orig_y - i) & ~1, 0, priv->h);
> >> +        dst_h = rh = av_clip((orig_h + 1 + i*2) & ~1, 0, priv->h - ry);
> > 
> > Is FFALIGN appropriate for any of these? A comment what exactly it
> > calculates would be good IMO.
> 
> Can't see where ...
> Also, I unfortunately have no idea what it tries to calculate.

Is the comment that immediately precedes the loop insufficient?  

The idea of the loop is that, in the case where the overlay image and
mpi are different heights (and therefore the overlay will need to be
scaled to fit the mpi), grow the slice (by shrinking y, and extending h
accordingly) until we determine a y value that is divisible by 2 for
both the pre-scaled y (on the overlay) and post-scaled y (on the mpi).

Hopefully that makes sense.

FFALIGN would work, but you'd want to subtract 1 for dst_y since we want
to round down, rather than up.  Something like:

        dst_y = ry = av_clip(FFALIGN(orig_y - i - 1, 2), 0, priv->h);
        dst_h = rh = av_clip(FFALIGN(orig_h + 1 + i*2, 2), 0, priv->h - ry);

> >> +    /* Dest is YV12 buffers offset to top of slice. */
> >> +    dst[0] = priv->y + (dst_y * priv->mpi_stride);
> >> +    dst[1] = priv->u + ((dst_y * priv->mpi_stride) >> 2);
> >> +    dst[2] = priv->v + ((dst_y * priv->mpi_stride) >> 2);
> > 
> > I think this should be (dst_y >> 1) * dst_strides[1]
> > because
> > 1) dst_strides[1] and [2] are calculated anyway, so why not use it,
> > it gives the thing a proper name
> > 2) While it is a given here, it removes the assumption of dst_y and
> > mpi_stride being divisible by two (which is necessary to merge the two
> >>> 1 into one >> 2) and fewer assumptions are always good.
> 
> Can you explain/detail this a bit more.
> I'm afraid being lost with this hunk.

He means to move the dst[x] assignments below dst_strides (the next 3
lines of code, not quoted above) and calculate dst[x] like:

        dst[0] = priv->y + dst_y * dst_strides[0];
        dst[1] = priv->y + (dst_y >> 1) * dst_strides[1];
        dst[2] = priv->y + (dst_y >> 1) * dst_strides[2];

This is perhaps subtly cleaner.

> >> +    asm volatile(
> >> +        "pxor %%mm7, %%mm7\n\t"                // zero out %mm7
> >> +        "pcmpeqb %%mm4, %%mm4\n\t"             // %mm4 = 255's
> >> +        "movq (%3), %%mm5\n\t"        // %mm5 = alpha
> >> +        "cmp $255, %4\n\t"           // don't apply layer alpha if it's 100% opaque
> >> +        "je 42f\n\t"
> > 
> > Just use 0, not 42, it's needlessly confusing.
> 
> Note that I have absolutely no understanding of ASM, being MMX or not.
> I just tried to replace 42 by 0, as suggested and MPlayer just crashes.

Yeah, '42' was just my lame attempt at being cute, I guess. :)  You'll
also have to change the label, which is a few lines down that begins
with "42:".  Change it to "0:" in addition to "je 0f"

> > I also think it would be far more readable if the operands were nicely
> > aligned.
> > 
> >> +        :  "+r" (dst_byte),             // %0
> >> +        "+r" (dst_alpha)             // %1
> > 
> > I think most likely these must be early-clobber (&)
> 
> What does that mean ?

I'm treading into a territory I have only but the most tentative of
grasps (even though I wrote that asm, a lot of it was trial and error
combined with short-lived caffeine-induced epiphanies), I can sort of
see why dst_byte would need to be early clobber (so change "+r" to
"+&r", but dst_alpha is written after all the inputs are read.

Need some more guidance from Reimar, I think.

> >> +        for (x = 0; x < (rw & ~7); x += 8)
> > 
> > Code above used - 7. It doesn't really matter if - 7 or & ~7, but it
> > should be the same everywhere.
> 
> Where have you see - 7 ?

The code that uses the check_opaque macro.  You might as well change it
above to use & ~7 (and & ~1 for the uint16_t variant).

Or change this to use rw - 7, it's up to you. :)

> >> +    int i, y, q = w / 8, r = w % 8;
> > 
> > Ugh, why use division?
> 
> What'd you prefer ? w >> 3 ?

Doesn't the compiler optimize this trivial stuff anyway?  I used the
current code because it seemed more readable.

> >> +                : "+r" (dst),         // %0
> >> +                "+r" (src),           // %1
> >> +                "+r" (overlay),       // %2
> >> +                "+r" (alpha)          // %3
> > 
> > Probably early-clobber?
> 
> Same question.

AFAIK it's used to hint to the compiler that input arguments (%4 in this
case, not quoted above) may be read even after the last write to all the
output arguments.  So change all the "+r" to "+&r", presumably.

Thanks,
Jason.