[MPlayer-dev-eng] [PATCH] VF Overlay

Sat Aug 8 08:28:08 CEST 2009

On Sat, Aug 08, 2009 at 12:25:11AM +0200, Benjamin Zores wrote:
> >> +#define check_opaque(type) \
> >> +    if (*(type*)(p + x)) { \
> >> +        if (slice_y1 == -2) \
> >> +            slice_y1 = y; \
> >> +        else \
> >> +            slice_y2 = y; \
> >> +        x = row_stride; \
> >> +        break; \
> >> +    }
> >> +
> >> +    for (y = 0; y < h; y++) {
> >> +        for (x = 0; x < row_stride - 7; x += 8)
> >> +            check_opaque(uint64_t);
> >> +        for (; x < row_stride - 3; x += 4)
> >> +            check_opaque(uint32_t);
> >> +        for (; x < row_stride - 1; x += 2)
> >> +            check_opaque(uint16_t);
> > 
> > I find it very hard to imagine that the extra uint32_t case gains even the
> > slightest bit of speed.
> 
> What does that mean ?

Is it clear what the code does?
It checks if a certain line contains any alpha value != 0.
As an optimization, it does that not byte-by-byte but checks 8
values at once (uint64_t).
Of course, if the width is not divisible by 8, that leaves a few
pixels you can't check like that.
That's what the uint32_t and uint16_t cases are for, but which check
only 4 and 2 pixels at once.
But since the 8 pixels at once loop already checks all except at most
the 6 last pixels, the check_opaque(uint32_t) loop will be done at most
once, thus save only a few instructions over checking all of those last
pixels in the check_opaque(uint16_t) loop, but adding an extra check and
branch.
Or to make matters short, IMO the code should at most be

> >> +        for (x = 0; x < row_stride - 7; x += 8)
> >> +            check_opaque(uint64_t);
> >> +        for (; x < row_stride - 1; x += 2)
> >> +            check_opaque(uint16_t);

And to avoid the code duplication, I'd suggest something like (speed
should probably be tested, though I see no reason why it should be
significantly slower):
uint64_t v = 0;
for (x = 0; !v && row_stride - 7; x += 8)
    v = *(uint64_t *)(p + x);
for (; !v && row_stride - 1; x += 2)
    v = *(uint16_t *)(p + x);
if (v) {
    if (slice_y1 == -2)
        slice_y1 = y;
    else
        slice_y2 = y;
}

> > Also the macro is again used to hide code duplication, not sure if there
> > is a nice way to avoid it, but it should at least be tried.
> 
> Isn't that the goal of macros (avoiding code duplication) ?

I call it "hiding code duplication" because at least in the binary the
code is still duplicated, and also because as above code shows there
is no need for a macro to avoid the duplication.

> >> +    i = priv->h != priv->mpi_h ? 4 : 0;
> >> +    do {
> >> +        dst_y = ry = av_clip((orig_y - i) & ~1, 0, priv->h);
> >> +        dst_h = rh = av_clip((orig_h + 1 + i*2) & ~1, 0, priv->h - ry);
> > 
> > Is FFALIGN appropriate for any of these? A comment what exactly it
> > calculates would be good IMO.
> 
> Can't see where ...

Well,
(orig_y - i) & ~1
could be written as
FFALIGN(orig_y - i - 1, 2)
and
orig_h + 1 + i*2
as
FFALIGN(orig_h + i*2, 2)
If that makes sense depends on the meaning of that code...

> Also, I unfortunately have no idea what it tries to calculate.
> 
> >> +    for (i = priv->w * rh; i>0; i--, p_bgr32 += 4, p_alpha++, p_bgr24 += 3) {
> >> +        /* Moving 32 bits is faster than 3 separate assignments (or one 16
> >> +         * bit and and one 8 bit move). The BGR24 buffer has one extra byte
> >> +         * allocated to prevent an overrun.
> >> +         */
> >> +        *(uint32_t *)p_bgr24 = *(uint32_t *)p_bgr32;
> > 
> > You can't do that, p_bgr24 is not aligned.
> 
> So, am I suppose to declare p_bgr24 as:
> 
> DECLARE_ALIGNED(32, uint8_t, *p_bgr24);
> 
> instead of:
> 
> uint8_t *p_bgr24
> 
> Is that so ?

No, it doesn't help one bit, the loop does p_bgr24 += 3, so p_bgr24 will
point to a non-aligned memory location somewhen (actually every 3 out of
4 loop iterations) and you can't use a (uint32_t *) cast on it.
You can either copy the 3 bytes individually or use the AV_WN32 macro.

> >> +    /* Dest is YV12 buffers offset to top of slice. */
> >> +    dst[0] = priv->y + (dst_y * priv->mpi_stride);
> >> +    dst[1] = priv->u + ((dst_y * priv->mpi_stride) >> 2);
> >> +    dst[2] = priv->v + ((dst_y * priv->mpi_stride) >> 2);
> > 
> > I think this should be (dst_y >> 1) * dst_strides[1]
> > because
> > 1) dst_strides[1] and [2] are calculated anyway, so why not use it,
> > it gives the thing a proper name
> > 2) While it is a given here, it removes the assumption of dst_y and
> > mpi_stride being divisible by two (which is necessary to merge the two
> >>> 1 into one >> 2) and fewer assumptions are always good.
> 
> Can you explain/detail this a bit more.
> I'm afraid being lost with this hunk.

I mean you should reorder that code to the in this case equivalent but
IMO "better":
dststride[0] = priv->mpi_stride;
dststride[1] = priv->mpi_stride >> 1;
dststride[2] = priv->mpi_stride >> 1;

dst[0] = priv->y +  dst_y       * dststride[0];
dst[1] = priv->u + (dst_y >> 1) * dststride[1];
dst[2] = priv->u + (dst_y >> 1) * dststride[2];

> >> +    asm volatile(
> >> +        "pxor %%mm7, %%mm7\n\t"                // zero out %mm7
> >> +        "pcmpeqb %%mm4, %%mm4\n\t"             // %mm4 = 255's
> >> +        "movq (%3), %%mm5\n\t"        // %mm5 = alpha
> >> +        "cmp $255, %4\n\t"           // don't apply layer alpha if it's 100% opaque
> >> +        "je 42f\n\t"
> > 
> > Just use 0, not 42, it's needlessly confusing.
> 
> Note that I have absolutely no understanding of ASM, being MMX or not.
> I just tried to replace 42 by 0, as suggested and MPlayer just crashes.

The 42 here and the 42: label below? 42 is just a number that refers to
the label below, changing that and the label to 0 should not make any
difference at all, except being less confusing IMHO.

> >> +        :  "+r" (dst_byte),             // %0
> >> +        "+r" (dst_alpha)             // %1
> > 
> > I think most likely these must be early-clobber (&)
> 
> What does that mean ?

Whenever you are still using any of the input operands after having
modified any of the output operands you must mark the output as
early-clobber.
That is done by e.g. instead of
> "+r" (dst_byte)
writing instead
> "+&r" (dst_byte)
Probably just mindlessly changing all +r to +&r is ok, though it would
be slightly better to check if it is actually necessary.

> >> +        for (x = 0; x < (rw & ~7); x += 8)
> > 
> > Code above used - 7. It doesn't really matter if - 7 or & ~7, but it
> > should be the same everywhere.
> 
> Where have you see - 7 ?

In the check_opaque loop. Personally I think using - 7 is better, looks
nicer to me, is a character less and I think it is used most often e.g.
in FFmpeg. I think it needs x and rw to be signed though (no idea if
that is the case).

> >> +    int i, y, q = w / 8, r = w % 8;
> > 
> > Ugh, why use division?
> 
> What'd you prefer ? w >> 3 ?

Yes,
w / 8 -> w >> 3
w % 8 -> w & 7
(in the past gcc was even stupid enough to use two divisions to
calculate both w / 8 and w % 8, even though the CPU's div instruction
always calculates both in one go - note that even with only one division
instruction above shift and and-mask should be faster).