[MPlayer-dev-eng] possible bugs in vf_decimate filter

Tue Oct 17 01:54:59 CEST 2006

On Mon, 16 Oct 2006, Michael Niedermayer wrote:
>
>
>                 "decl %%ecx \n\t"
>
> id use a "cmp %%"REG_S", ...  here, some cpus have a dissike for inc/dec as
> inc/dec just change part of the flags which creates a dependancy to the
> previous flag value

The code is also using ecx without listing it as a clobbered register.  It
only works because there is nothing else in the function for gcc to put in
ecx, and gcc isn't smart enough to spill one of the hardcoded callee saved
registers, like edi, into ecx instead of to the stack.

Speaking of hard coded registers, why do that?  Why demand "S", "D", "b",
etc.  when one could just as easily use "r"?  And instead of demanding the
address of short out[4] in edx, why not use "m"(out[0]) instead of
"d"(out)?  That way gcc will give you something like "8(%esp)", which frees
up edx and doesn't require gcc to emit something like "lea 8(%esp), %edx"
to set the register up.

>                 "jnz 1b \n\t"
>                 "movq %%mm4, (%%"REG_d") \n\t"
>                 "emms \n\t"
>
> emms should be farther outside

That's what I thought too.  But when I did it, it became _much_ slower!  Take
a look at this code and see you notice what's wrong:

static int diff_to_drop_plane(int hi, int lo, float frac, unsigned char *old, unsigned char *new,
	int w, int h, int os, int ns)
{
        int x, y;
        int d, c=0, r=0;
        int t = (w/16)*(h/16)*frac;
        for (y = 0; y < h-7; y += 4) {
                for (x = 0; x < w-7; x += 4) {
                        d = diff(old+x+y*os, new+x+y*ns, os, ns);  /* 1 */
                        if (d > hi) goto done;
                        if (d > lo) {  c++; if (c > t) goto done; }
                }
        }
        r = 1;
done:
	asm volatile("emms");   /* 2 */
        return r;
}

I removed the "emms" from the asm block in the diff_MMX() function.  This code
ended up slowing the filter down by around 20%!  It has nothing do with the
extra variable 'r' or the goto.  If I add an extra emms right after the call to
diff at (1), it speeds up to around the original speed (even with the extra emms
as (2), the goto, and the variable 'r').

You can clearly see that there is no floating point between (1), where the
mmx diff function is called, and (2), where emms is run.  The variable 't'
is used, but it is an int, which was calculated before the loop using a
float, 'frac'.

There is no floating point in the C code, but look at the generated assembly!

        call    *diff
        cmpl    %eax, 212(%esp)
        flds    48(%esp)
        jge     .L112
        ffreep  %st(0)

You see the call to the diff_MMX function via the diff function pointer,
which is only used in that one place.  Then there is some floating point,
flds and ffreep, with NO emms!  The emms is in the asm code elsewhere, gcc
has just moved the floating point around so it's not in the right spot
anymore!

Using mmx in asm statements anywhere near floating point is very dangerous.
For example, this code will miscompile and print "nan" if you use -O3 (so
that mmxfunc() is inlined).  There is clearly not a missing emms here.  The
problem is that gcc leaves 'f' in a floating pointer register across the
inlined call to mmxfunc(), and the mmx code clears the fp regiters.

#include <math.h>
#include <stdio.h>
void mmxfunc(void) { asm("emms" : : : "st"); }
int main(void){
    float f = log(10.0);
    mmxfunc();
    printf("%f\n", f);
    return 0;
}