[MPlayer-dev-eng] [PATCH] replacement for internal mpg123 fork (mp3lib), what is performance?
Reimar Döffinger
Reimar.Doeffinger at gmx.de
Sun May 30 11:58:18 CEST 2010
On Sun, May 30, 2010 at 10:52:14AM +0200, Thomas Orgis wrote:
> Well, it's faster than plain C, for one. Now, with unlimited time and
> will, you could implement every possible combination of instructions.
> Yes, our assembler code is fixed to either 32 bit or 64 bit -- it uses
> the respective instruction set explicitly. It maximizes portability
> between build environments; mpg123 shall not just work with gcc, it
> works with other compilers, too -- including the assembly optimizations.
Since AFAICT you use the preprocessor for the asm code anyway, MPlayer's
64-bits compatible code should just work as well, as long as you add a
special 64-bit function prologue/end.
Also which compilers specifically does the mpg123 code work on that
MPlayer's does not work on?
> While porting over the SSE code to x86-64 is a logical thing to do, as
> the instruction set includes SSE2, it did not occur to me that one
> should also take over the other legacy instruction subsets.
> Did AMD further improve the 3DNow unit after dropping this technology
> in favour of adopting SSE? I see no reason to assume that it's worth it
> to make 3DNow work on x86-64, only to see that this architecture is
> optimized for SSE (2) and that the latter really works better.
I doubt there's a point speed-wise, but when you hit strange performance
cases like just now you could at least test all cases with the same build
on the same computer to establish a baseline.
> > Conclusion: You know that 32 bit sse is faster that 32 bit 3dnowext.
> > You don't know (you can only guess) how the 32 bit SSE and 3dnow code
> > would behave if compiled for 64 bit.
>
> Well, then... let's stop guessing: Would you be so kind and provide
> numbers on how fast MMX/3DNow/3DNowExt/SSE are in MPlayer's mp3lib on
> x86-64? User CPU time for
> decoding the whole album http://www.jamendo.com/de/download/album/7328
> would be preferrable.
time ./mplayer -nocache -ao pcm:file=/dev/null:fast *.mp3
Default:
real 0m5.263s
user 0m4.608s
sys 0m0.080s
After porting the 3dnow code:
real 0m5.928s
user 0m5.268s
sys 0m0.100s
So obviously the auto-selection is making a bad choice on modern x86 CPUs
Starting from this bad autoselection:
Forcing dct64_sse:
real 0m5.759s
user 0m5.084s
sys 0m0.100s
Forcing dct64_MMX:
real 0m6.188s
user 0m5.488s
sys 0m0.116s
Forcing dct64_MMX_3dnow:
real 0m6.026s
user 0m5.332s
sys 0m0.112s
Forcing dct64_MMX_3dnowex:
real 0m5.914s
user 0m5.244s
sys 0m0.108s
Now staying with dct64_see and changing the dct36_func function:
dct36:
real 0m5.308s
user 0m4.568s
sys 0m0.164s
dct36_3dnow:
real 0m5.797s
user 0m5.104s
sys 0m0.120s
dct36_3dnowex:
real 0m5.801s
user 0m5.160s
sys 0m0.076s
WTF? The asm functions are vastly slower than pure C?
Do we have that issue on 32 bit as well?
> > The standalone assembler also has the disadvantage that you can't inline
> > the functions even when compiling for a fixed CPU
>
> Since we arent't talking about small functions that are called
> repeatedly in some inner loop, but about significant pieces of work
> being done en bloc, I would like to see some proof that inlining
> actually would give significant improvement.
Unlikely, still it is going to be a disadvantage.
It might make more of a difference on Win64, since you have to push a whole
lot of SSE2 registers onto the stack and restore them again.
> Apart from valid and debatable points on mpg123's approach on
> optimizations... do you have an idea about the anomalies I observed
> regarding variance of mp3lib performance, and subtle effects of
> unwitting changes in code in general?
Not much, but I'd really start with cleaning up the stack issues
with the 3dnow code, they might cause cache issue that can easily
have that kind of effect.
More information about the MPlayer-dev-eng
mailing list