[MPlayer-dev-eng] [PATCH] replacement for internal mpg123 fork (mp3lib), what is performance?

Sun May 30 08:29:16 CEST 2010

On Sun, May 30, 2010 at 02:08:52AM +0200, Thomas Orgis wrote:
> Are you talking only about the dct64 or did you also look at the synth?

I am talking about teh dct36, but dct64 has similar things, I do not know
about synth.

> > Argh! And it doesn't even compile on x86_64 (./configure --with-cpu=3dnow).
> 
> It was never intended to build on x86-64. We have improved SSE code by
> Taihei Monma for that platform, including special variants for mono/stereo,
> accurate rounding, different sample formats.

How do you maintain the 3DNow code if it doesn't even compile (at least
it doesn't on any of my standard systems)?

> Do you suggest that 3DNow would be a good choice for x86-64? I mean, I
> could somehow understand that one might want to use 3DnowExt, but our
> SSE code works on any x86-64 CPU, and it's enabling mpg123 (the
> console app) to decode faster than mplayer anyway

If the MPlayer code didn't exist, how would you ever find out?
If your x86_64 code was 10% slower than the other code you have, would
you even find out or would you never notice because the 32 bit code is
slower for other reasons and you simply can't do a fair 1:1 comparison
because the code does not even compile?

> mpg123-32 --cpu 3dnowext: 9.8 seconds
> mpg123-32 --cpu sse:      9.2 seconds
> mpg123-64 (--cpu sse):    8.6 seconds
> mplayer-64:               9.2 seconds
> 
> I see 64 bit SSE as a clear winner here... I wouldn't bet on 64bit
> 3DNowExt beating that in mpg123.

Conclusion: You know that 32 bit sse is faster that 32 bit 3dnowext.
You don't know (you can only guess) how the 32 bit SSE and 3dnow code
would behave if compiled for 64 bit.
And the 32 and 64 bit SSE code isn't the same either, so you can't even
use that to establish a baseline for what kind osf speedup (or even slowdown)
just 32 vs. 64 bit gives on its own.
The standalone assembler also has the disadvantage that you can't inline
the functions even when compiling for a fixed CPU (and in contrast to the
yasm code in FFmpeg it doesn't even give you features like a kind of
automatic register allocation and automatic entry/exit code generation).