[MPlayer-dev-eng] [PATCH] VF Overlay
Reimar Döffinger
Reimar.Doeffinger at gmx.de
Sat Aug 8 08:28:08 CEST 2009
On Sat, Aug 08, 2009 at 12:25:11AM +0200, Benjamin Zores wrote:
> >> +#define check_opaque(type) \
> >> + if (*(type*)(p + x)) { \
> >> + if (slice_y1 == -2) \
> >> + slice_y1 = y; \
> >> + else \
> >> + slice_y2 = y; \
> >> + x = row_stride; \
> >> + break; \
> >> + }
> >> +
> >> + for (y = 0; y < h; y++) {
> >> + for (x = 0; x < row_stride - 7; x += 8)
> >> + check_opaque(uint64_t);
> >> + for (; x < row_stride - 3; x += 4)
> >> + check_opaque(uint32_t);
> >> + for (; x < row_stride - 1; x += 2)
> >> + check_opaque(uint16_t);
> >
> > I find it very hard to imagine that the extra uint32_t case gains even the
> > slightest bit of speed.
>
> What does that mean ?
Is it clear what the code does?
It checks if a certain line contains any alpha value != 0.
As an optimization, it does that not byte-by-byte but checks 8
values at once (uint64_t).
Of course, if the width is not divisible by 8, that leaves a few
pixels you can't check like that.
That's what the uint32_t and uint16_t cases are for, but which check
only 4 and 2 pixels at once.
But since the 8 pixels at once loop already checks all except at most
the 6 last pixels, the check_opaque(uint32_t) loop will be done at most
once, thus save only a few instructions over checking all of those last
pixels in the check_opaque(uint16_t) loop, but adding an extra check and
branch.
Or to make matters short, IMO the code should at most be
> >> + for (x = 0; x < row_stride - 7; x += 8)
> >> + check_opaque(uint64_t);
> >> + for (; x < row_stride - 1; x += 2)
> >> + check_opaque(uint16_t);
And to avoid the code duplication, I'd suggest something like (speed
should probably be tested, though I see no reason why it should be
significantly slower):
uint64_t v = 0;
for (x = 0; !v && row_stride - 7; x += 8)
v = *(uint64_t *)(p + x);
for (; !v && row_stride - 1; x += 2)
v = *(uint16_t *)(p + x);
if (v) {
if (slice_y1 == -2)
slice_y1 = y;
else
slice_y2 = y;
}
> > Also the macro is again used to hide code duplication, not sure if there
> > is a nice way to avoid it, but it should at least be tried.
>
> Isn't that the goal of macros (avoiding code duplication) ?
I call it "hiding code duplication" because at least in the binary the
code is still duplicated, and also because as above code shows there
is no need for a macro to avoid the duplication.
> >> + i = priv->h != priv->mpi_h ? 4 : 0;
> >> + do {
> >> + dst_y = ry = av_clip((orig_y - i) & ~1, 0, priv->h);
> >> + dst_h = rh = av_clip((orig_h + 1 + i*2) & ~1, 0, priv->h - ry);
> >
> > Is FFALIGN appropriate for any of these? A comment what exactly it
> > calculates would be good IMO.
>
> Can't see where ...
Well,
(orig_y - i) & ~1
could be written as
FFALIGN(orig_y - i - 1, 2)
and
orig_h + 1 + i*2
as
FFALIGN(orig_h + i*2, 2)
If that makes sense depends on the meaning of that code...
> Also, I unfortunately have no idea what it tries to calculate.
>
> >> + for (i = priv->w * rh; i>0; i--, p_bgr32 += 4, p_alpha++, p_bgr24 += 3) {
> >> + /* Moving 32 bits is faster than 3 separate assignments (or one 16
> >> + * bit and and one 8 bit move). The BGR24 buffer has one extra byte
> >> + * allocated to prevent an overrun.
> >> + */
> >> + *(uint32_t *)p_bgr24 = *(uint32_t *)p_bgr32;
> >
> > You can't do that, p_bgr24 is not aligned.
>
> So, am I suppose to declare p_bgr24 as:
>
> DECLARE_ALIGNED(32, uint8_t, *p_bgr24);
>
> instead of:
>
> uint8_t *p_bgr24
>
> Is that so ?
No, it doesn't help one bit, the loop does p_bgr24 += 3, so p_bgr24 will
point to a non-aligned memory location somewhen (actually every 3 out of
4 loop iterations) and you can't use a (uint32_t *) cast on it.
You can either copy the 3 bytes individually or use the AV_WN32 macro.
> >> + /* Dest is YV12 buffers offset to top of slice. */
> >> + dst[0] = priv->y + (dst_y * priv->mpi_stride);
> >> + dst[1] = priv->u + ((dst_y * priv->mpi_stride) >> 2);
> >> + dst[2] = priv->v + ((dst_y * priv->mpi_stride) >> 2);
> >
> > I think this should be (dst_y >> 1) * dst_strides[1]
> > because
> > 1) dst_strides[1] and [2] are calculated anyway, so why not use it,
> > it gives the thing a proper name
> > 2) While it is a given here, it removes the assumption of dst_y and
> > mpi_stride being divisible by two (which is necessary to merge the two
> >>> 1 into one >> 2) and fewer assumptions are always good.
>
> Can you explain/detail this a bit more.
> I'm afraid being lost with this hunk.
I mean you should reorder that code to the in this case equivalent but
IMO "better":
dststride[0] = priv->mpi_stride;
dststride[1] = priv->mpi_stride >> 1;
dststride[2] = priv->mpi_stride >> 1;
dst[0] = priv->y + dst_y * dststride[0];
dst[1] = priv->u + (dst_y >> 1) * dststride[1];
dst[2] = priv->u + (dst_y >> 1) * dststride[2];
> >> + asm volatile(
> >> + "pxor %%mm7, %%mm7\n\t" // zero out %mm7
> >> + "pcmpeqb %%mm4, %%mm4\n\t" // %mm4 = 255's
> >> + "movq (%3), %%mm5\n\t" // %mm5 = alpha
> >> + "cmp $255, %4\n\t" // don't apply layer alpha if it's 100% opaque
> >> + "je 42f\n\t"
> >
> > Just use 0, not 42, it's needlessly confusing.
>
> Note that I have absolutely no understanding of ASM, being MMX or not.
> I just tried to replace 42 by 0, as suggested and MPlayer just crashes.
The 42 here and the 42: label below? 42 is just a number that refers to
the label below, changing that and the label to 0 should not make any
difference at all, except being less confusing IMHO.
> >> + : "+r" (dst_byte), // %0
> >> + "+r" (dst_alpha) // %1
> >
> > I think most likely these must be early-clobber (&)
>
> What does that mean ?
Whenever you are still using any of the input operands after having
modified any of the output operands you must mark the output as
early-clobber.
That is done by e.g. instead of
> "+r" (dst_byte)
writing instead
> "+&r" (dst_byte)
Probably just mindlessly changing all +r to +&r is ok, though it would
be slightly better to check if it is actually necessary.
> >> + for (x = 0; x < (rw & ~7); x += 8)
> >
> > Code above used - 7. It doesn't really matter if - 7 or & ~7, but it
> > should be the same everywhere.
>
> Where have you see - 7 ?
In the check_opaque loop. Personally I think using - 7 is better, looks
nicer to me, is a character less and I think it is used most often e.g.
in FFmpeg. I think it needs x and rw to be signed though (no idea if
that is the case).
> >> + int i, y, q = w / 8, r = w % 8;
> >
> > Ugh, why use division?
>
> What'd you prefer ? w >> 3 ?
Yes,
w / 8 -> w >> 3
w % 8 -> w & 7
(in the past gcc was even stupid enough to use two divisions to
calculate both w / 8 and w % 8, even though the CPU's div instruction
always calculates both in one go - note that even with only one division
instruction above shift and and-mask should be faster).
More information about the MPlayer-dev-eng
mailing list