[MPlayer-dev-eng] [PATCH] replacement for internal mpg123 fork (mp3lib), what is performance?

Sun May 30 11:58:18 CEST 2010

On Sun, May 30, 2010 at 10:52:14AM +0200, Thomas Orgis wrote:
> Well, it's faster than plain C, for one. Now, with unlimited time and
> will, you could implement every possible combination of instructions.
> Yes, our assembler code is fixed to either 32 bit or 64 bit -- it uses
> the respective instruction set explicitly. It maximizes portability
> between build environments; mpg123 shall not just work with gcc, it
> works with other compilers, too -- including the assembly optimizations.

Since AFAICT you use the preprocessor for the asm code anyway, MPlayer's
64-bits compatible code should just work as well, as long as you add a
special 64-bit function prologue/end.
Also which compilers specifically does the mpg123 code work on that
MPlayer's does not work on?

> While porting over the SSE code to x86-64 is a logical thing to do, as
> the instruction set includes SSE2, it did not occur to me that one
> should also take over the other legacy instruction subsets.
> Did AMD further improve the 3DNow unit after dropping this technology
> in favour of adopting SSE? I see no reason to assume that it's worth it
> to make 3DNow work on x86-64, only to see that this architecture is
> optimized for SSE (2) and that the latter really works better.

I doubt there's a point speed-wise, but when you hit strange performance
cases like just now you could at least test all cases with the same build
on the same computer to establish a baseline.

> > Conclusion: You know that 32 bit sse is faster that 32 bit 3dnowext.
> > You don't know (you can only guess) how the 32 bit SSE and 3dnow code
> > would behave if compiled for 64 bit.
> 
> Well, then... let's stop guessing: Would you be so kind and provide
> numbers on how fast MMX/3DNow/3DNowExt/SSE are in MPlayer's mp3lib on
> x86-64? User CPU time for
> decoding the whole album http://www.jamendo.com/de/download/album/7328
> would be preferrable.

time ./mplayer -nocache -ao pcm:file=/dev/null:fast *.mp3

Default:
real    0m5.263s
user    0m4.608s
sys     0m0.080s

After porting the 3dnow code:
real    0m5.928s
user    0m5.268s
sys     0m0.100s

So obviously the auto-selection is making a bad choice on modern x86 CPUs
Starting from this bad autoselection:

Forcing dct64_sse:
real    0m5.759s
user    0m5.084s
sys     0m0.100s

Forcing dct64_MMX:
real    0m6.188s
user    0m5.488s
sys     0m0.116s

Forcing dct64_MMX_3dnow:
real    0m6.026s
user    0m5.332s
sys     0m0.112s

Forcing dct64_MMX_3dnowex:
real    0m5.914s
user    0m5.244s
sys     0m0.108s

Now staying with dct64_see and changing the dct36_func function:

dct36:
real    0m5.308s
user    0m4.568s
sys     0m0.164s

dct36_3dnow:
real    0m5.797s
user    0m5.104s
sys     0m0.120s

dct36_3dnowex:
real    0m5.801s
user    0m5.160s
sys     0m0.076s

WTF? The asm functions are vastly slower than pure C?
Do we have that issue on 32 bit as well?

> > The standalone assembler also has the disadvantage that you can't inline
> > the functions even when compiling for a fixed CPU
> 
> Since we arent't talking about small functions that are called
> repeatedly in some inner loop, but about significant pieces of work
> being done en bloc, I would like to see some proof that inlining
> actually would give significant improvement.

Unlikely, still it is going to be a disadvantage.
It might make more of a difference on Win64, since you have to push a whole
lot of SSE2 registers onto the stack and restore them again.

> Apart from valid and debatable points on mpg123's approach on
> optimizations... do you have an idea about the anomalies I observed
> regarding variance of mp3lib performance, and subtle effects of
> unwitting changes in code in general?

Not much, but I'd really start with cleaning up the stack issues
with the 3dnow code, they might cause cache issue that can easily
have that kind of effect.