[MPlayer-dev-eng] [PATCH] replacement for internal mpg123 fork (mp3lib), what is performance?

Sun May 30 10:52:14 CEST 2010

Am Sun, 30 May 2010 08:29:16 +0200
schrieb Reimar Döffinger <Reimar.Doeffinger at gmx.de>: 

> On Sun, May 30, 2010 at 02:08:52AM +0200, Thomas Orgis wrote:
> > Are you talking only about the dct64 or did you also look at the synth?
> 
> I am talking about teh dct36, but dct64 has similar things, I do not know
> about synth.

OK, both functions come from Syuuhei Kashiyama, so not surprising that
there is a common pattern. But apart from stylistic critique: Do we
have numbers on how efficient the dct36/64 of mpg123 and mp3lib are in
comparison? Perhaps there isn't a significant difference after all.
I dont't have a setup right now where I can profile the 3DNow code with
mplayer (need to install the Sun compilers on some AMD machines, or
figure out how oprofile works...) -- you do have a proper profiling
setup for MPlayer? But then, apparently not on a 32 bit system...

> > > Argh! And it doesn't even compile on x86_64 (./configure --with-cpu=3dnow).
> > 
> > It was never intended to build on x86-64. We have improved SSE code by
> > Taihei Monma for that platform, including special variants for mono/stereo,
> > accurate rounding, different sample formats.
> 
> How do you maintain the 3DNow code if it doesn't even compile (at least
> it doesn't on any of my standard systems)?

It compiles on the platform it's intended for. See the ISO MPEG
compliance block on http://mpg123.org/ ? That's from a daily build run on an Athlon XP.

> If the MPlayer code didn't exist, how would you ever find out?
> If your x86_64 code was 10% slower than the other code you have, would
> you even find out or would you never notice because the 32 bit code is
> slower for other reasons and you simply can't do a fair 1:1 comparison
> because the code does not even compile?

Well, it's faster than plain C, for one. Now, with unlimited time and
will, you could implement every possible combination of instructions.
Yes, our assembler code is fixed to either 32 bit or 64 bit -- it uses
the respective instruction set explicitly. It maximizes portability
between build environments; mpg123 shall not just work with gcc, it
works with other compilers, too -- including the assembly optimizations.

While porting over the SSE code to x86-64 is a logical thing to do, as
the instruction set includes SSE2, it did not occur to me that one
should also take over the other legacy instruction subsets.
Did AMD further improve the 3DNow unit after dropping this technology
in favour of adopting SSE? I see no reason to assume that it's worth it
to make 3DNow work on x86-64, only to see that this architecture is
optimized for SSE (2) and that the latter really works better.

> Conclusion: You know that 32 bit sse is faster that 32 bit 3dnowext.
> You don't know (you can only guess) how the 32 bit SSE and 3dnow code
> would behave if compiled for 64 bit.

Well, then... let's stop guessing: Would you be so kind and provide
numbers on how fast MMX/3DNow/3DNowExt/SSE are in MPlayer's mp3lib on
x86-64? User CPU time for
decoding the whole album http://www.jamendo.com/de/download/album/7328
would be preferrable.

> And the 32 and 64 bit SSE code isn't the same either, so you can't even
> use that to establish a baseline for what kind osf speedup (or even slowdown)
> just 32 vs. 64 bit gives on its own.

Well, we had times where the 32 bit code was faster than 64 bit, I'm
not so sure if that includes the availability of the new x86-64 SSE
routines... Ah, no. Archives are your friend:
http://sourceforge.net/mailarchive/forum.php?thread_name=20090403101557.3c8ab796%40sunscreen.local&forum_name=mpg123-devel

> The standalone assembler also has the disadvantage that you can't inline
> the functions even when compiling for a fixed CPU

Since we arent't talking about small functions that are called
repeatedly in some inner loop, but about significant pieces of work
being done en bloc, I would like to see some proof that inlining
actually would give significant improvement.

> (and in contrast to the
> yasm code in FFmpeg it doesn't even give you features like a kind of
> automatic register allocation and automatic entry/exit code generation).

Yes, there might be convenience in using a smart assembler ... one that
also gets the difference between x86 and x86-64 in SSE2 code and
automagically makes use of the extra SSE registers (in addition to the
normal ones). But then... there is the dream of compilers eventually
getting the hang of it, too, and starting to generate meaningful SSE
code. Some at least try.

Apart from valid and debatable points on mpg123's approach on
optimizations... do you have an idea about the anomalies I observed
regarding variance of mp3lib performance, and subtle effects of
unwitting changes in code in general?

Alrighty then,

Thomas.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20100530/b364e047/attachment.pgp>