[MPlayer-dev-eng] [PATCH] replacement for internal mpg123 fork (mp3lib), what is performance?
Thomas Orgis
thomas-forum at orgis.org
Sun May 30 02:08:52 CEST 2010
Am Sat, 29 May 2010 18:44:37 +0200
schrieb Reimar Döffinger <Reimar.Doeffinger at gmx.de>:
> > I'd like to add that part of the issue may be that... Well...
> > The mpg123 3dnow code simply is going to be slower, because
> > 1) it calls femms _twice_ per dct function call
> > 2) it genrates a stack frame (i.e. saves/changes ebp). MPlayer by default does not.
> > 3) it reserves 120 bytes on stack for local variables even though it never
> > uses any on-stack variables.
> > 4) worse, it pushes the ebx and esi registers _after_ the stack increase,
> > thus increasing cache pressure for no good reason at all.
... Thanks for noting these points. I did not work on the 3DNow code
apart from keeping it working -- seems there are some old sins left
over from its inception.
Are you talking only about the dct64 or did you also look at the synth?
> > Maybe I am missing some good reason for this, but so far I think that
> > the code, honestly spoken, is simply crap.
Well, you shouldn't judge it too harshly: It's what the MPlayer code
is derived from;-)
> Argh! And it doesn't even compile on x86_64 (./configure --with-cpu=3dnow).
It was never intended to build on x86-64. We have improved SSE code by
Taihei Monma for that platform, including special variants for mono/stereo,
accurate rounding, different sample formats.
Do you suggest that 3DNow would be a good choice for x86-64? I mean, I
could somehow understand that one might want to use 3DnowExt, but our
SSE code works on any x86-64 CPU, and it's enabling mpg123 (the
console app) to decode faster than mplayer anyway -- although, the
marging can be slim (on my K6-3+ with 3DNowExt, it's 55.0 seconds
against 55.6 seconds user CPU time for decoding the test album I
indicated). I also tested a 32 bit mpg123 build (for Athlon XP) with a
64 bit build on an Opteron 2210, also comparing with mplayer+mp3lib:
mpg123-32 --cpu 3dnowext: 9.8 seconds
mpg123-32 --cpu sse: 9.2 seconds
mpg123-64 (--cpu sse): 8.6 seconds
mplayer-64: 9.2 seconds
I see 64 bit SSE as a clear winner here... I wouldn't bet on 64bit
3DNowExt beating that in mpg123.
But I have trouble diagnosing the performance behaviour of MPlayer
with regards to ad_mpg123. We already diagnosed the issue of inefficient
memory handling, which is improved in the mpg123 trunk / snapshots.
But there is still Diego's K6-3+, in final agreement with mine, that
doesn't really like the new ad_mpg123, decoding significantly slower
than stand-alone mpg123.
At the moment, I am still rather clueless about this... I did observe,
though, that the K6-3+ seems to be rather sensitive to code layout.
I observed a huge effect of a change in mpg123 1.8.0, which only
shuffled a bit on the storage of the function pointers used to select
optimized routines at runtime. It effectively makes the dynamic code as
fast as the static (--with-cpu=x86 against
--with-cpu=3dnowext_alone). But then, have a look at his comparison
with mpg123 1.6.4 and current snapshot (where I had to fix the
3dnowext_alone build again, sorry). This is decoding the second track
of the Convergence album, timing result on the k6-3+ -- always using
3DNowExt decoding, but either via runtime or build-time choice:
mpg123-1.6.4-x86
real 0m6.045s
user 0m5.970s
sys 0m0.080s
mpg123-1.6.4-3dnowext_alone
real 0m4.922s
user 0m4.860s
sys 0m0.060s
mpg123-20100524000000-x86
real 0m5.016s
user 0m4.920s
sys 0m0.090s
mpg123-20100524000000-3dnowext_alone
real 0m5.010s
user 0m4.910s
sys 0m0.090s
Observe the non-significant difference in the two build variants of the
current version, and the gross hit the dynamic code gets with 1.6.4 .
Then, compare to the deal on an Athlon XP:
mpg123-1.6.4-x86
real 0m3.921s
user 0m1.110s
sys 0m0.050s
mpg123-1.6.4-3dnowext_alone
real 0m2.404s
user 0m1.190s
sys 0m0.040s
OK, that one is a bit tight... but a tendency is appearing. Let's do the
whole album to drive the point home:
mpg123-1.6.4-x86
real 0m12.285s
user 0m11.750s
sys 0m0.360s
mpg123-1.6.4-3dnowext_alone
real 0m12.782s
user 0m12.430s
sys 0m0.350s
So here, the dynamic code wins over the build-time optimization! The
SSE build of 1.6.4 is as fast as 3DNowExt, the improved SSE code of
current mpg123 shaves off another half second for the whole album.
Well... that just to tell the story about how we are dealing with
subtle effects besides any hand-crafting of assembly instructions. This
might not be about the CPU as such, but instead the glibc or gcc
version (my test machines have different Linux systems...), but in
effect, it's about the current system setup, not just about the code.
Now for something really freakish: On my Thinkpad X200 (Core2Duo, 64
bit system), I observe a change of the _mp3lib_ performance in mplayer
depending on a change in ad_mpg123.c . I kid you not, that is what I
see.
Please see the attached new version of the patch. It includes
preprocessor action to select different configurations and ways to do
I/O to libmpg123 (using the latest snapshot of mpg123).
Specifically, the setup
#define AD_MPG123_CALLBACK
#define AD_MPG123_PACKET
/* #define AD_MPG123_SEEKBUFFER */
leads to the following measurement (on the Thinkpad X200):
mplayer-svn$ for i in mpg123 mp3; do echo $i; time ./mplayer -ao
pcm:file=/dev/null -quiet -ac $i ../../convergence_-_points_of_view/*.mp3 > /dev/null ; done
mpg123
real 0m6.170s
user 0m6.099s
sys 0m0.054s
mp3
real 0m6.107s
user 0m6.060s
sys 0m0.029s
While
#define AD_MPG123_CALLBACK
/* #define AD_MPG123_PACKET */
/* #define AD_MPG123_SEEKBUFFER */
gives that:
mplayer-svn$ for i in mpg123 mp3; do echo $i; time ./mplayer -ao pcm:file=/dev/null -quiet -ac $i ../../convergence_-_points_of_view/*.mp3 > /dev/null ; done
mpg123
real 0m6.204s
user 0m6.131s
sys 0m0.048s
mp3
real 0m6.505s
user 0m6.448s
sys 0m0.042s
So... my intention was to investigate what does slow down mpg123
decoding in mplayer... and now I managed to significantly slow down
mp3lib without touching its code! Can someone reproduce that (with gcc
4.3.3)? I consider any tuning of ad_mpg123 futile as long as we have
unexplained effects of such scale.
I close with some profile data (collected via 'collect' of Sun Studio,
builds are done with gcc, though):
The first setup, mpg123, using packets:
Excl. Incl. Name
User CPU User CPU
sec. sec.
6,940 6,940 <Total>
3,035 3,035 III_dequantize_sample
1,925 1,925 <static>@0x33e79
0,913 0,913 dct36
0,440 4,553 do_layer3
0,132 0,132 memcpy
0,055 0,055 III_get_scale_factors_1
0,055 0,055 mxf_probe
0,055 0,066 synth_1to1_stereo_x86_64
0,044 0,044 fast_memcpy
0,033 0,033 dct12
0,033 0,033 __read_nocancel
0,022 0,022 ff_ac3_parse_header
0,011 0,033 ac3_eac3_probe
0,011 0,011 compute_bpf
0,011 0,011 dts_probe
0,011 0,011 dv_probe
0,011 0,077 generic_head_read
0,011 0,055 generic_read_frame_body
0,011 0,011 h261_probe
0,011 0,011 h263_probe
0,011 0,011 _int_free
0,011 0,011 ipmovie_probe
0,011 0,011 memset
0,011 0,011 mpegps_probe
0,011 0,011 __mul
0,011 0,011 nut_probe
0,011 0,110 plain_fullread
0,011 0,011 __select_nocancel
0,011 0,011 strcmp
0,011 0,011 __write_nocancel
0, 0,165 av_probe_input_format2
0, 4,784 decode_audio
0, 4,751 decode_audio
0, 4,531 decode_the_frame
0, 0,044 demux_audio_fill_buffer
0, 0,011 demux_info_print
0, 0,165 demux_open
0, 0,165 demux_open_stream
0, 0,055 ds_fill_buffer
0, 0,055 ds_get_packet
0, 0,055 ds_get_packet_pts
0, 0,011 __dvd
0, 0,033 fill_buffer
0, 0,011 fputs
0, 0,011 free
0, 0,143 get_next_frame
0, 0,011 init_audio
0, 0,011 init_best_audio_codec
0, 0,011 init_layer3
0, 0,011 _IO_new_do_write
0, 0,011 _IO_new_file_write
0, 0,011 _IO_new_file_xsputn
0, 0,165 lavf_check_file
0, 0,165 lavf_check_preferred_file
0, 0,011 __libc_start_main
0, 4,982 main
0, 4,751 mpg123_decode
0, 0,011 mpg123_init
0, 0,011 mp_input_get_cmd
0, 0,011 mp_msg
0, 0,011 __mptan
0, 0,011 new_do_write
0, 0,011 preinit
0, 0,099 read_callback
0, 0,143 read_frame
0, 0,011 reinit_audio_chain
0, 0,011 set_pointer
0, 0,033 stream_fill_buffer
0, 0,044 stream_read
0, 0,011 tan
0, 0,011 tanMp
...and this is mp3lib:
Excl. Incl. Name
User CPU User CPU
sec. sec.
6,764 6,764 <Total>
3,332 3,332 III_dequantize_sample
1,034 1,276 synth_1to1_MMX
0,935 0,935 dct36
0,627 0,627 dct64_sse
0,429 4,762 do_layer3
0,077 0,077 memcpy
0,055 0,055 ff_ac3_parse_header
0,044 0,044 dct12
0,033 0,033 mxf_probe
0,022 0,022 fast_memcpy
0,022 0,022 ipmovie_probe
0,022 0,022 __mul
0,011 0,011 analyze
0,011 0,055 ds_fill_buffer
0,011 0,011 __exp1
0,011 0,011 III_get_scale_factors_1
0,011 0,011 _int_free
0,011 0,011 malloc
0,011 4,850 MP3_DecodeFrame
0,011 0,011 mpegps_probe
0,011 0,011 nsv_probe
0,011 0,011 parse_codec_cfg
0,011 0,011 stream_fill_buffer
0,011 0,011 synth_1to1
0, 0,055 ac3_eac3_probe
0, 0,143 av_probe_input_format2
0, 0,022 __c32
0, 4,916 decode_audio
0, 0,044 demux_audio_fill_buffer
0, 0,143 demux_open
0, 0,143 demux_open_stream
0, 0,077 demux_read_data
0, 0,011 __ieee754_pow
0, 0,011 _int_realloc
0, 0,143 lavf_check_file
0, 0,143 lavf_check_preferred_file
0, 5,070 main
0, 0,033 MP3_Init
0, 0,011 mpegts_probe
0, 0,022 __mptan
0, 0,011 pow
0, 0,011 realloc
0, 0,022 stream_read
0, 0,022 tan
0, 0,022 tanMp
Now the second variant:
Excl. Incl. Name
User CPU User CPU
sec. sec.
6,929 6,929 <Total>
3,244 3,244 III_dequantize_sample
1,848 1,848 <static>@0x33e79
0,891 0,891 dct36
0,242 4,597 do_layer3
0,110 0,121 synth_1to1_stereo_x86_64
0,088 0,088 memcpy
0,077 0,077 III_get_scale_factors_1
0,066 0,066 mxf_probe
0,055 0,055 ff_ac3_parse_header
0,044 0,044 fast_memcpy
0,044 0,044 _int_malloc
0,033 0,033 __read_nocancel
0,033 0,033 <Unknown>
0,022 0,022 dct12
0,022 0,022 ipmovie_probe
0,011 0,066 ac3_eac3_probe
0,011 0,011 compute_bpf
0,011 0,121 demux_audio_fill_buffer
0,011 0,176 demux_read_data
0,011 0,011 __ieee754_pow
0,011 4,806 mpg123_decode
0,011 0,198 read_frame
0,011 0,011 strcmp
0,011 0,011 unrar_exec_get
0,011 0,011 __write_nocancel
0, 0,154 av_probe_input_format2
0, 4,850 decode_audio
0, 4,806 decode_audio
0, 4,586 decode_the_frame
0, 0,011 decode_update
0, 0,154 demux_open
0, 0,154 demux_open_stream
0, 0,121 ds_fill_buffer
0, 0,033 fill_buffer
0, 0,011 fwrite
0, 0,132 generic_head_read
0, 0,044 generic_read_frame_body
0, 0,209 get_next_frame
0, 0,033 gettimeofday
0, 0,022 GetTimer
0, 0,011 GetTimerMS
0, 0,011 init
0, 0,011 init_audio
0, 0,011 init_best_audio_codec
0, 0,011 init_layer3_gainpow2
0, 0,011 init_layer3_stuff
0, 0,011 _IO_new_do_write
0, 0,011 _IO_new_file_write
0, 0,011 _IO_new_file_xsputn
0, 0,154 lavf_check_file
0, 0,154 lavf_check_preferred_file
0, 0,011 __libc_start_main
0, 5,070 main
0, 0,044 malloc
0, 0,011 mpg123_getformat
0, 0,011 new_do_write
0, 0,176 plain_fullread
0, 0,011 play
0, 0,011 pow
0, 0,011 rar_open
0, 0,176 read_callback
0, 0,011 reinit_audio_chain
0, 0,011 reopen_stream
0, 0,011 set_synth_functions
0, 0,033 stream_fill_buffer
0, 0,066 stream_read
0, 0,011 update_osd_msg
0, 0,011 vobsub_open
0, 0,011 vobsub_parse_ifo
mp3lib:
Excl. Incl. Name
User CPU User CPU
sec. sec.
7,204 7,204 <Total>
3,409 3,409 III_dequantize_sample
1,474 1,782 synth_1to1_MMX
0,726 0,726 dct36
0,682 0,682 dct64_sse
0,374 4,597 do_layer3
0,066 0,066 fast_memcpy
0,044 0,044 ff_ac3_parse_header
0,044 0,044 _int_malloc
0,044 0,044 memcpy
0,044 0,044 __read_nocancel
0,044 0,055 synth_1to1
0,033 0,033 h261_probe
0,033 0,033 mxf_probe
0,022 0,022 III_get_scale_factors_1
0,022 0,022 _int_free
0,022 0,022 ipmovie_probe
0,022 0,022 memset
0,022 0,022 __select_nocancel
0,011 0,011 af_fix_parameters
0,011 0,011 analyze
0,011 0,011 dct12
0,011 0,187 demux_read_data
0,011 0,011 __ieee754_pow
0,011 0,011 nsv_probe
0,011 0,011 strcmp
0, 0,044 ac3_eac3_probe
0, 0,154 av_probe_input_format2
0, 4,839 decode_audio
0, 0,099 demux_audio_fill_buffer
0, 0,154 demux_open
0, 0,154 demux_open_stream
0, 0,110 ds_fill_buffer
0, 0,044 fill_buffer
0, 0,011 free
0, 0,011 _int_realloc
0, 0,154 lavf_check_file
0, 0,154 lavf_check_preferred_file
0, 5,026 main
0, 0,044 malloc
0, 4,784 MP3_DecodeFrame
0, 0,011 MP3_Init
0, 0,011 mpegts_probe
0, 0,022 mp_input_get_cmd
0, 0,011 parse_codec_cfg
0, 0,011 pow
0, 0,011 realloc
0, 0,044 stream_fill_buffer
0, 0,044 stream_read
Now... I'd be thankful for explanations...
Oh, I did an SVN update to make the patch current again... after that,
the two cases look like that:
mpg123
real 0m6.179s
user 0m6.107s
sys 0m0.052s
mp3
real 0m5.806s
user 0m5.754s
sys 0m0.035s
versus
mpg123
real 0m6.205s
user 0m6.142s
sys 0m0.045s
mp3
real 0m6.372s
user 0m6.313s
sys 0m0.044s
If I'd only knew what gives mp3lib that erratic performance boost for
the first case. This latest number is in the same range, if not a tick
better even than plain mpg123. I am confident that I'd get totally
differing numbers when I start trying out different compilers... but I
need some time to waste on other things than this hobby. Sleep, for
example.
Alrighty then,
Thomas.
PS: I'm BCC'ing the mpg123 devel list, mainly as an attention raiser for
mpg123 folks and reminder for myself. No need to copy the whole
discussion, though.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mplayer-libmpg123.diff
Type: text/x-patch
Size: 25445 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20100530/34a3a9fd/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20100530/34a3a9fd/attachment.pgp>
More information about the MPlayer-dev-eng
mailing list