[MPlayer-dev-eng] [PATCH] SSE version of DCT64 for mp3lib

Thu Jun 22 15:43:20 CEST 2006

Hi,

Zuxy Meng wrote:

> Attached is dct64_sse, a replacement for dct64_MMX. About 60% faster
> on my Pentium III (not very exciting, though:-)). Good for Intel users
> because mp3lib only has dct64_3dnow and dct64_3dnowext.

60% faster is not bad!

> Don't know if it's faster than dct64_3dnowext on an AthlonXP or
> Athlon64. To be prudent, dct64_3dnowext is used when both 3DNow!Ext
> and SSE are supported.

I shall benchmark it before applying.

> ------------------------------------------------------------------------
> 
> --- mplayer/mp3lib/Makefile	2006-06-07 12:53:34.000000000 +0800
> +++ mplayer.new/mp3lib/Makefile	2006-06-21 16:03:38.000000000 +0800
> @@ -15,8 +15,8 @@
>  OBJS += decode_MMX.o dct64_MMX.o tabinit_MMX.o
>  SRCS += dct36_3dnow.c dct64_3dnow.c
>  OBJS += dct36_3dnow.o dct64_3dnow.o
> -SRCS += dct36_k7.c dct64_k7.c
> -OBJS += dct36_k7.o dct64_k7.o
> +SRCS += dct36_k7.c dct64_k7.c dct64_sse.c
> +OBJS += dct36_k7.o dct64_k7.o dct64_sse.o
>  endif
>  ifeq ($(TARGET_ARCH_POWERPC),yes)
>  ifeq ($(TARGET_ALTIVEC),yes)
> --- mplayer/mp3lib/sr1.c	2006-06-17 12:13:02.000000000 +0800
> +++ mplayer.new/mp3lib/sr1.c	2006-06-21 16:03:25.000000000 +0800
> @@ -392,6 +392,7 @@
>  extern void dct64_MMX(real *, real *, real *);
>  extern void dct64_MMX_3dnow(real *, real *, real *);
>  extern void dct64_MMX_3dnowex(real *, real *, real *);
> +extern void dct64_sse(real *, real *, real *);
>  void (*dct64_MMX_func)(real *, real *, real *);
>  
>  #include "cpudetect.h"
> @@ -434,6 +435,12 @@
>  	mp_msg(MSGT_DECAUDIO,MSGL_V,"mp3lib: using 3DNow! optimized decore!\n");
>      }
>      else
> +    if (gCpuCaps.hasSSE)
> +    {
> +	dct64_MMX_func = dct64_sse;
> +	mp_msg(MSGT_DECAUDIO,MSGL_V,"mp3lib: using SSE optimized decore!\n");
> +    }
> +    else
>      if (gCpuCaps.hasMMX)
>      {
>  	dct64_MMX_func = dct64_MMX;
> --- mplayer/mp3lib/decode_MMX.c	2006-06-07 12:53:34.000000000 +0800
> +++ mplayer.new/mp3lib/decode_MMX.c	2006-06-21 16:04:06.000000000 +0800
> @@ -13,7 +13,7 @@
>  
>  static unsigned long long attribute_used __attribute__((aligned(8))) null_one = 0x0000ffff0000ffffULL;
>  static unsigned long long attribute_used __attribute__((aligned(8))) one_null = 0xffff0000ffff0000ULL;
> -unsigned long __attribute__((aligned(8))) costab_mmx[] =
> +unsigned long __attribute__((aligned(16))) costab_mmx[] =
>  {
>  	1056974725,
>  	1057056395,
> --- mplayer/mp3lib/dct64_sse.c	1970-01-01 08:00:00.000000000 +0800
> +++ mplayer.new/mp3lib/dct64_sse.c	2006-06-21 15:57:53.000000000 +0800
> @@ -0,0 +1,437 @@
> +/*
> + * Discrete Cosine Tansform (DCT) for SSE
> + * Copyright (c) 2006 Zuxy MENG <zuxy.meng at gmail.com>
> + * based upon code from mp3lib/dct64.c, mp3lib/dct64_altivec.c
> + * and mp3lib/dct64_MMX.c
> + */
> +
> +/* NOTE: The following code is suboptimal! It can be improved (at least) by
> +
> +   1. Replace all movups by movaps. (Can Parameter c be always aligned on 
> +      a 16-byte boundary?)

According to GCC's documentation, it _looks_ like it's possible, and
that if you rely on default behaviour, you can count on the first
argument to be aligned. If you need more than one argument or smth
other than the first argument of the aligned, the only possibility I
know of is to do some padding (i.e. add some dummy arguments to
artificially align the parameters) It's AFAIK not possible to anything
other than that because it would break the ABI.

BTW: the pointer itself doesn't need to be aligned, it's just the
address it points to that needs to be aligned, which is the
responsibility of the caller.

from:
https://developer.apple.com/documentation/DeveloperTools/gcc-4.0.1/gcc/i386-and-x86_002d64-Options.html

-mpreferred-stack-boundary=num
    Attempt to keep the stack boundary aligned to a 2 raised to num
byte boundary. If -mpreferred-stack-boundary is not specified, the
default is 4 (16 bytes or 128 bits), except when optimizing for code
size (-Os), in which case the default is the minimum correct alignment
(4 bytes for x86, and 8 bytes for x86-64).

    On Pentium and PentiumPro, double and long double values should be
aligned to an 8 byte boundary (see -malign-double) or suffer
significant run time performance penalties. On Pentium III, the
Streaming SIMD Extension (SSE) data type __m128 suffers similar
penalties if it is not 16 byte aligned.

    To ensure proper alignment of this values on the stack, the stack
boundary must be as aligned as that required by any value stored on
the stack. Further, every function must be generated such that it
keeps the stack aligned. Thus calling a function compiled with a
higher preferred stack boundary from a function compiled with a lower
preferred stack boundary will most likely misalign the stack. It is
recommended that libraries that use callbacks always use the default
setting.

    This extra alignment does consume extra stack space, and generally
increases code size. Code that is sensitive to stack space usage, such
as embedded systems and operating system kernels, may want to reduce
the preferred alignment to -mpreferred-stack-boundary=2.

[..]

> +#if 0
> +    /* Reference C code */
> +
> +    /*
> +       Should run faster than x87 asm, given that the compiler is sane.
> +       However, the C code dosen't round with saturation (0x7fff for too
> +       large positive float, 0x8000 for too small negative float). You
> +       can hear the difference if you listen carefully.
> +    */

I'm surprised that current code doesn't have a C reference code....
;-) Good thinking!

Well, this would need some test on AMD64 to make sure it works on
these machines.

Unless someone disagrees, and provided that it works, I'll probably
apply this in a few days...

Guillaume