[MPlayer-dev-eng] [PATCH] replacement for internal mpg123 fork (mp3lib), what is performance?

Thomas Orgis thomas-forum at orgis.org
Sun May 30 02:08:52 CEST 2010

Am Sat, 29 May 2010 18:44:37 +0200
schrieb Reimar Döffinger <Reimar.Doeffinger at gmx.de>: 

> > I'd like to add that part of the issue may be that... Well...
> > The mpg123 3dnow code simply is going to be slower, because
> > 1) it calls femms _twice_ per dct function call
> > 2) it genrates a stack frame (i.e. saves/changes ebp). MPlayer by default does not.
> > 3) it reserves 120 bytes on stack for local variables even though it never
> >    uses any on-stack variables.
> > 4) worse, it pushes the ebx and esi registers _after_ the stack increase,
> >    thus increasing cache pressure for no good reason at all.

... Thanks for noting these points. I did not work on the 3DNow code
apart from keeping it working -- seems there are some old sins left
over from its inception.
Are you talking only about the dct64 or did you also look at the synth?

> > Maybe I am missing some good reason for this, but so far I think that
> > the code, honestly spoken, is simply crap.

Well, you shouldn't judge it too harshly: It's what the MPlayer code
is derived from;-)

> Argh! And it doesn't even compile on x86_64 (./configure --with-cpu=3dnow).

It was never intended to build on x86-64. We have improved SSE code by
Taihei Monma for that platform, including special variants for mono/stereo,
accurate rounding, different sample formats.
Do you suggest that 3DNow would be a good choice for x86-64? I mean, I
could somehow understand that one might want to use 3DnowExt, but our
SSE code works on any x86-64 CPU, and it's enabling mpg123 (the
console app) to decode faster than mplayer anyway -- although, the
marging can be slim (on my K6-3+ with 3DNowExt, it's 55.0 seconds
against 55.6 seconds user CPU time for decoding the test album I
indicated). I also tested a 32 bit mpg123 build (for Athlon XP) with a
64 bit build on an Opteron 2210, also comparing with mplayer+mp3lib:

mpg123-32 --cpu 3dnowext: 9.8 seconds
mpg123-32 --cpu sse:      9.2 seconds
mpg123-64 (--cpu sse):    8.6 seconds
mplayer-64:               9.2 seconds

I see 64 bit SSE as a clear winner here... I wouldn't bet on 64bit
3DNowExt beating that in mpg123.

But I have trouble diagnosing the performance behaviour of MPlayer
with regards to ad_mpg123. We already diagnosed the issue of inefficient
memory handling, which is improved in the mpg123 trunk / snapshots.
But there is still Diego's K6-3+, in final agreement with mine, that
doesn't really like the new ad_mpg123, decoding significantly slower
than stand-alone mpg123.

At the moment, I am still rather clueless about this... I did observe,
though, that the K6-3+ seems to be rather sensitive to code layout.
I observed a huge effect of a change in mpg123 1.8.0, which only
shuffled a bit on the storage of the function pointers used to select
optimized routines at runtime. It effectively makes the dynamic code as
fast as the static (--with-cpu=x86 against
--with-cpu=3dnowext_alone). But then, have a look at his comparison
with mpg123 1.6.4 and current snapshot (where I had to fix the
3dnowext_alone build again, sorry). This is decoding the second track
of the Convergence album, timing result on the k6-3+ -- always using
3DNowExt decoding, but either via runtime or build-time choice:


real    0m6.045s
user    0m5.970s
sys     0m0.080s

real    0m4.922s
user    0m4.860s
sys     0m0.060s

real    0m5.016s
user    0m4.920s
sys     0m0.090s

real    0m5.010s
user    0m4.910s
sys     0m0.090s

Observe the non-significant difference in the two build variants of the
current version, and the gross hit the dynamic code gets with 1.6.4 .
Then, compare to the deal on an Athlon XP:


real	0m3.921s
user	0m1.110s
sys	0m0.050s

real	0m2.404s
user	0m1.190s
sys	0m0.040s

OK, that one is a bit tight... but a tendency is appearing. Let's do the
whole album to drive the point home:


real	0m12.285s
user	0m11.750s
sys	0m0.360s

real	0m12.782s
user	0m12.430s
sys	0m0.350s

So here, the dynamic code wins over the build-time optimization! The
SSE build of 1.6.4 is as fast as 3DNowExt, the improved SSE code of
current mpg123 shaves off another half second for the whole album.

Well... that just to tell the story about how we are dealing with
subtle effects besides any hand-crafting of assembly instructions. This
might not be about the CPU as such, but instead the glibc or gcc
version (my test machines have different Linux systems...), but in
effect, it's about the current system setup, not just about the code.

Now for something really freakish: On my Thinkpad X200 (Core2Duo, 64
bit system), I observe a change of the _mp3lib_ performance in mplayer
depending on a change in ad_mpg123.c . I kid you not, that is what I

Please see the attached new version of the patch. It includes
preprocessor action to select different configurations and ways to do
I/O to libmpg123 (using the latest snapshot of mpg123).

Specifically, the setup

#define AD_MPG123_CALLBACK
#define AD_MPG123_PACKET
/* #define AD_MPG123_SEEKBUFFER */

leads to the following measurement (on the Thinkpad X200):

mplayer-svn$ for i in mpg123 mp3; do echo $i;  time ./mplayer -ao
pcm:file=/dev/null  -quiet -ac $i ../../convergence_-_points_of_view/*.mp3 > /dev/null ; done

real	0m6.170s
user	0m6.099s
sys	0m0.054s

real	0m6.107s
user	0m6.060s
sys	0m0.029s


#define AD_MPG123_CALLBACK
/* #define AD_MPG123_PACKET */
/* #define AD_MPG123_SEEKBUFFER */

gives that:

mplayer-svn$ for i in mpg123 mp3; do echo $i;  time ./mplayer -ao pcm:file=/dev/null  -quiet -ac $i ../../convergence_-_points_of_view/*.mp3 > /dev/null ; done

real	0m6.204s
user	0m6.131s
sys	0m0.048s

real	0m6.505s
user	0m6.448s
sys	0m0.042s

So... my intention was to investigate what does slow down mpg123
decoding in mplayer... and now I managed to significantly slow down
mp3lib without touching its code! Can someone reproduce that (with gcc
4.3.3)? I consider any tuning of ad_mpg123 futile as long as we have
unexplained effects of such scale.

I close with some profile data (collected via 'collect' of Sun Studio,
builds are done with gcc, though):

The first setup, mpg123, using packets:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
6,940     6,940      <Total>
3,035     3,035      III_dequantize_sample
1,925     1,925      <static>@0x33e79
0,913     0,913      dct36
0,440     4,553      do_layer3
0,132     0,132      memcpy
0,055     0,055      III_get_scale_factors_1
0,055     0,055      mxf_probe
0,055     0,066      synth_1to1_stereo_x86_64
0,044     0,044      fast_memcpy
0,033     0,033      dct12
0,033     0,033      __read_nocancel
0,022     0,022      ff_ac3_parse_header
0,011     0,033      ac3_eac3_probe
0,011     0,011      compute_bpf
0,011     0,011      dts_probe
0,011     0,011      dv_probe
0,011     0,077      generic_head_read
0,011     0,055      generic_read_frame_body
0,011     0,011      h261_probe
0,011     0,011      h263_probe
0,011     0,011      _int_free
0,011     0,011      ipmovie_probe
0,011     0,011      memset
0,011     0,011      mpegps_probe
0,011     0,011      __mul
0,011     0,011      nut_probe
0,011     0,110      plain_fullread
0,011     0,011      __select_nocancel
0,011     0,011      strcmp
0,011     0,011      __write_nocancel
0,        0,165      av_probe_input_format2
0,        4,784      decode_audio
0,        4,751      decode_audio
0,        4,531      decode_the_frame
0,        0,044      demux_audio_fill_buffer
0,        0,011      demux_info_print
0,        0,165      demux_open
0,        0,165      demux_open_stream
0,        0,055      ds_fill_buffer
0,        0,055      ds_get_packet
0,        0,055      ds_get_packet_pts
0,        0,011      __dvd
0,        0,033      fill_buffer
0,        0,011      fputs
0,        0,011      free
0,        0,143      get_next_frame
0,        0,011      init_audio
0,        0,011      init_best_audio_codec
0,        0,011      init_layer3
0,        0,011      _IO_new_do_write
0,        0,011      _IO_new_file_write
0,        0,011      _IO_new_file_xsputn
0,        0,165      lavf_check_file
0,        0,165      lavf_check_preferred_file
0,        0,011      __libc_start_main
0,        4,982      main
0,        4,751      mpg123_decode
0,        0,011      mpg123_init
0,        0,011      mp_input_get_cmd
0,        0,011      mp_msg
0,        0,011      __mptan
0,        0,011      new_do_write
0,        0,011      preinit
0,        0,099      read_callback
0,        0,143      read_frame
0,        0,011      reinit_audio_chain
0,        0,011      set_pointer
0,        0,033      stream_fill_buffer
0,        0,044      stream_read
0,        0,011      tan
0,        0,011      tanMp

...and this is mp3lib:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
6,764     6,764      <Total>
3,332     3,332      III_dequantize_sample
1,034     1,276      synth_1to1_MMX
0,935     0,935      dct36
0,627     0,627      dct64_sse
0,429     4,762      do_layer3
0,077     0,077      memcpy
0,055     0,055      ff_ac3_parse_header
0,044     0,044      dct12
0,033     0,033      mxf_probe
0,022     0,022      fast_memcpy
0,022     0,022      ipmovie_probe
0,022     0,022      __mul
0,011     0,011      analyze
0,011     0,055      ds_fill_buffer
0,011     0,011      __exp1
0,011     0,011      III_get_scale_factors_1
0,011     0,011      _int_free
0,011     0,011      malloc
0,011     4,850      MP3_DecodeFrame
0,011     0,011      mpegps_probe
0,011     0,011      nsv_probe
0,011     0,011      parse_codec_cfg
0,011     0,011      stream_fill_buffer
0,011     0,011      synth_1to1
0,        0,055      ac3_eac3_probe
0,        0,143      av_probe_input_format2
0,        0,022      __c32
0,        4,916      decode_audio
0,        0,044      demux_audio_fill_buffer
0,        0,143      demux_open
0,        0,143      demux_open_stream
0,        0,077      demux_read_data
0,        0,011      __ieee754_pow
0,        0,011      _int_realloc
0,        0,143      lavf_check_file
0,        0,143      lavf_check_preferred_file
0,        5,070      main
0,        0,033      MP3_Init
0,        0,011      mpegts_probe
0,        0,022      __mptan
0,        0,011      pow
0,        0,011      realloc
0,        0,022      stream_read
0,        0,022      tan
0,        0,022      tanMp

Now the second variant:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
6,929     6,929      <Total>
3,244     3,244      III_dequantize_sample
1,848     1,848      <static>@0x33e79
0,891     0,891      dct36
0,242     4,597      do_layer3
0,110     0,121      synth_1to1_stereo_x86_64
0,088     0,088      memcpy
0,077     0,077      III_get_scale_factors_1
0,066     0,066      mxf_probe
0,055     0,055      ff_ac3_parse_header
0,044     0,044      fast_memcpy
0,044     0,044      _int_malloc
0,033     0,033      __read_nocancel
0,033     0,033      <Unknown>
0,022     0,022      dct12
0,022     0,022      ipmovie_probe
0,011     0,066      ac3_eac3_probe
0,011     0,011      compute_bpf
0,011     0,121      demux_audio_fill_buffer
0,011     0,176      demux_read_data
0,011     0,011      __ieee754_pow
0,011     4,806      mpg123_decode
0,011     0,198      read_frame
0,011     0,011      strcmp
0,011     0,011      unrar_exec_get
0,011     0,011      __write_nocancel
0,        0,154      av_probe_input_format2
0,        4,850      decode_audio
0,        4,806      decode_audio
0,        4,586      decode_the_frame
0,        0,011      decode_update
0,        0,154      demux_open
0,        0,154      demux_open_stream
0,        0,121      ds_fill_buffer
0,        0,033      fill_buffer
0,        0,011      fwrite
0,        0,132      generic_head_read
0,        0,044      generic_read_frame_body
0,        0,209      get_next_frame
0,        0,033      gettimeofday
0,        0,022      GetTimer
0,        0,011      GetTimerMS
0,        0,011      init
0,        0,011      init_audio
0,        0,011      init_best_audio_codec
0,        0,011      init_layer3_gainpow2
0,        0,011      init_layer3_stuff
0,        0,011      _IO_new_do_write
0,        0,011      _IO_new_file_write
0,        0,011      _IO_new_file_xsputn
0,        0,154      lavf_check_file
0,        0,154      lavf_check_preferred_file
0,        0,011      __libc_start_main
0,        5,070      main
0,        0,044      malloc
0,        0,011      mpg123_getformat
0,        0,011      new_do_write
0,        0,176      plain_fullread
0,        0,011      play
0,        0,011      pow
0,        0,011      rar_open
0,        0,176      read_callback
0,        0,011      reinit_audio_chain
0,        0,011      reopen_stream
0,        0,011      set_synth_functions
0,        0,033      stream_fill_buffer
0,        0,066      stream_read
0,        0,011      update_osd_msg
0,        0,011      vobsub_open
0,        0,011      vobsub_parse_ifo


Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
7,204     7,204      <Total>
3,409     3,409      III_dequantize_sample
1,474     1,782      synth_1to1_MMX
0,726     0,726      dct36
0,682     0,682      dct64_sse
0,374     4,597      do_layer3
0,066     0,066      fast_memcpy
0,044     0,044      ff_ac3_parse_header
0,044     0,044      _int_malloc
0,044     0,044      memcpy
0,044     0,044      __read_nocancel
0,044     0,055      synth_1to1
0,033     0,033      h261_probe
0,033     0,033      mxf_probe
0,022     0,022      III_get_scale_factors_1
0,022     0,022      _int_free
0,022     0,022      ipmovie_probe
0,022     0,022      memset
0,022     0,022      __select_nocancel
0,011     0,011      af_fix_parameters
0,011     0,011      analyze
0,011     0,011      dct12
0,011     0,187      demux_read_data
0,011     0,011      __ieee754_pow
0,011     0,011      nsv_probe
0,011     0,011      strcmp
0,        0,044      ac3_eac3_probe
0,        0,154      av_probe_input_format2
0,        4,839      decode_audio
0,        0,099      demux_audio_fill_buffer
0,        0,154      demux_open
0,        0,154      demux_open_stream
0,        0,110      ds_fill_buffer
0,        0,044      fill_buffer
0,        0,011      free
0,        0,011      _int_realloc
0,        0,154      lavf_check_file
0,        0,154      lavf_check_preferred_file
0,        5,026      main
0,        0,044      malloc
0,        4,784      MP3_DecodeFrame
0,        0,011      MP3_Init
0,        0,011      mpegts_probe
0,        0,022      mp_input_get_cmd
0,        0,011      parse_codec_cfg
0,        0,011      pow
0,        0,011      realloc
0,        0,044      stream_fill_buffer
0,        0,044      stream_read

Now... I'd be thankful for explanations...

Oh, I did an SVN update to make the patch current again... after that,
the two cases look like that:


real	0m6.179s
user	0m6.107s
sys	0m0.052s

real	0m5.806s
user	0m5.754s
sys	0m0.035s



real	0m6.205s
user	0m6.142s
sys	0m0.045s

real	0m6.372s
user	0m6.313s
sys	0m0.044s

If I'd only knew what gives mp3lib that erratic performance boost for
the first case. This latest number is in the same range, if not a tick
better even than plain mpg123. I am confident that I'd get totally
differing numbers when I start trying out different compilers... but I
need some time to waste on other things than this hobby. Sleep, for

Alrighty then,


PS: I'm BCC'ing the mpg123 devel list, mainly as an attention raiser for
mpg123 folks and reminder for myself. No need to copy the whole
discussion, though.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mplayer-libmpg123.diff
Type: text/x-patch
Size: 25445 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20100530/34a3a9fd/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20100530/34a3a9fd/attachment.pgp>

More information about the MPlayer-dev-eng mailing list