[MPlayer-dev-eng] [PATCH] x86_64 mmx/sse/3dnow optimisation support

Tue Oct 12 20:23:38 CEST 2004

On Tue, 12 Oct 2004 12:18:42 +0200
Reimar Döffinger <Reimar.Doeffinger at stud.uni-karlsruhe.de> wrote:

> Hi,
> 
> > -#ifdef ARCH_X86
> > +#if defined(ARCH_X86) || defined(ARCH_X86_64)
> 
> There are lots of these. From what I know, I find it incorrect to
> define X86_64 as a separate architecture, those are "only" 64bit
> extensions after all...
> Think you would need to change less if you put the x86_64 case to the 
> x86 case, and just adding something like:
> 
> if ("$host_arch" == x86_64); then
> _def_arch += "#define ARCH_X86_64 1\n$_def_arch"
> fi
> 
> so that both ARCH_X86 and ARCH_X86_64 are defined then.

Sure I could do this. But I thought It would be clearer to have
distinct ARCH.
I personnaly prefer considering x86_64 as a really different arch
(having 64 bits addressing instead of 32 bits, is a big enough
difference IMHO).
Any other one with an opinion about diffining or not ARCH_X86 for
x86_64 traget ?

> > -       unsigned char *u, unsigned char *v, int w, int us, int vs)
> > +       unsigned char *u, unsigned char *v, int w, long us, long vs)
> >  {
> >         asm volatile (""
> > -               "pushl %%ebp \n\t"
> > -               "movl 4(%%edx), %%ebp \n\t"
> > -               "movl (%%edx), %%edx \n\t"
> > +               "push %%"REG_BP" \n\t"
> > +#ifdef ARCH_X86_64
> > +               "mov %6, %%"REG_BP" \n\t"
> > +#else
> > +               "movl 4(%%"REG_d"), %%"REG_BP" \n\t"
> > +               "movl (%%"REG_d"), %%"REG_d" \n\t"
> > +#endif
> [...]
> > -               : "S" (y), "D" (dst), "a" (u), "b" (v), "d" (&us),
> > "c" (w/16)
> > +               : "S" (y), "D" (dst), "a" (u), "b" (v),
> > "c" (w/16),
> > +#ifdef ARCH_X86_64
> > +               "d" (us), "r" (vs)
> > +#else
> > +               "d" (&us)
> > +#endif
> 
> Weird. Why was this done this way? Because your X86_64 version works
> on my AMD Athlon, too... I think this difference shouldn't be there...

I thought it was done that way because all registers were allready used.
In fact esp and ebp was not used but I thought they couldn't be used.
I used anoter way on x86_64 because of the bigger number of GPR.
I will do some more tests on x86 and hopefully remove the old way to
do this.

> > @@ -48,18 +48,18 @@
> >  #undef HAVE_MMX
> >  #undef HAVE_MMX2
> >  #undef HAVE_3DNOW
> > -#undef ARCH_X86
> > +
> > +#ifndef CAN_COMPILE_X86_ASM
> > 
> >  #ifdef COMPILE_C
> >  #undef HAVE_MMX
> >  #undef HAVE_MMX2
> >  #undef HAVE_3DNOW
> > -#undef ARCH_X86
> >  #define RENAME(a) a ## _C
> >  #include "osd_template.c"
> >  #endif
> > 
> > -#ifdef CAN_COMPILE_X86_ASM
> > +#else
> > 
> >  //X86 noMMX versions
> >  #ifdef COMPILE_C
> > @@ -67,7 +67,6 @@
> >  #undef HAVE_MMX
> >  #undef HAVE_MMX2
> >  #undef HAVE_3DNOW
> > -#define ARCH_X86
> >  #define RENAME(a) a ## _X86
> >  #include "osd_template.c"
> >  #endif
> > @@ -78,7 +77,6 @@
> >  #define HAVE_MMX
> >  #undef HAVE_MMX2
> >  #undef HAVE_3DNOW
> > -#define ARCH_X86
> >  #define RENAME(a) a ## _MMX
> >  #include "osd_template.c"
> >  #endif
> 
> I don't think this is correct. Without #undef ARCH_X86 you won't get
> the pure C versions of some functions AFAIK... You should add #undef 
> ARCH_X86_64 instead.

No. The pure C version will only be built when CAN_COMPILE_X86_ASM is
not defined, which mean that neither ARCH_X86 nor ARCH_X86_64 is
defined at that point.

> Also, have you checked that
> +#ifndef CAN_COMPILE_X86_ASM doesn't break runtime cpudetection?

I tested runtime cpu detection without any problem.

> What are the changes in libvo/osd_template.c for?

In 64 bits mode, x86_64 can't access to upper halves (ah, ch...)
So I modified slightly the algorithm, but as it add 2 shift
instruction, I kept the original algorithm for x86.

> > @@ -2582,7 +2581,8 @@
> >                         int srcStride1, int srcStride2,
> >                         int srcStride3, int dstStride)
> >  {
> > -    unsigned y,x,w,h;
> > +    unsigned y,w,h;
> > +    unsigned long x;
> 
> Huh? I'd say either change them all or modify only

Only x is used for 64 bits addressing in asm code. The other
can still be 32 bits without problem.
Anyway, it's probably cleaner to change them all, so I modified
this.

> > : "+r" (x)
> 
> > +#undef REAL_MOVNTQ
> >  #undef MOVNTQ
> >  #undef PAVGB
> >  #undef PREFETCH
> > @@ -54,29 +55,30 @@
> >  #endif
> > 
> >  #ifdef HAVE_MMX2
> > -#define MOVNTQ(a,b) "movntq " #a ", " #b " \n\t"
> > +#define REAL_MOVNTQ(a,b) "movntq " #a ", " #b " \n\t"
> >  #else
> > -#define MOVNTQ(a,b) "movq " #a ", " #b " \n\t"
> > +#define REAL_MOVNTQ(a,b) "movq " #a ", " #b " \n\t"
> >  #endif
> > +#define MOVNTQ(a,b)  REAL_MOVNTQ(a,b)
> 
> You added a lot of such REAL_* defines to postproc/swscale_template.c,
> what are they there for? They don't seem to be used anywhere...

This is because I then use something like MOVNTQ(%%REGa,%%REGb).
REGa is also a macro, and it you try to use REAL_MOVNTQ(%%REGa,%%REGb),
the REGa macro won't be substituted. That's why one more indirection
level is required.
I used this because this is the simplest way I found, and because
something similar was already used in MPlayer.

> Reducing the number of changed code line as far as possible and maybe 

I know this, and I already tried to reduce it as much as possible.

> also adding comments on the non-obvious part/changes will greatly 
> increase the probability of getting this applied.

Unfortunatly, after writting this code, everything seem obvious ;-)
Sorry, but I don't really see what I could comment.

> Btw: I'd expect there was some reason why "addl"/"popl" etc. were used
> in the original code and not "add"/"pop" etc. Can somebody explain?

They were there to explicitly show that this was a 32 bits operation,
But 'as' is smart enough to deduce the operation size from the type
of operand (eax => 32 bits, rax => 64 bits).
Moreover this modification (addl => add) was suggested by Michael on
ffmpeg-devel, and he accepted my patch for ffmpeg, so I can't see how
it could be wrong ;-)

Here is an updated version of my patch. It also fix a small bug
there was in my first version (an int to long cast was missing in
swscale causing a segfault).

Aurel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MPlayer-x86_64.diff.bz2
Type: application/octet-stream
Size: 18304 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20041012/563db092/attachment.obj>