[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Zoltan Hidvegi mplayer at hzoli.2y.net
Thu Sep 9 22:58:52 CEST 2004


> > Also note that context switching is very expensive, it varies depending
> > on the CPU and operating system between 100s and 1000s of cycles.
> > Nothing that you want to do too often.
> 
> What you wrote is very reasonable, so I tried a little test to see how
> much speed is lost when running a lot of processes simultaneously.
> 
> My test is:
> create ten files
> 
>   for i in `seq 0 9`; do cp /var/log/messages $i;done
> 
> now compress them (bzip2 -1 has a sorting stage operating on 100000 bytes,
> which can stay inside the L2 cache on my Athlon)
> 
>   time sh -c 'for i in `seq 0 9`; do bzip2 -1 <$i >/dev/null ; done ;wait'
> 
> ("wait" is useless in this case)
> now try again running the ten processes in parallel
> 
>   time sh -c 'for i in `seq 0 9`; do bzip2 -1 <$i >/dev/null & done ;wait'
> 
> I was expecting a significant difference, but the results are very close.
> Indeed, running the test multiple times, it can easily happen that the parallel
> version takes less time.

You argument is flawed here.  You are running independent jobs here
which do not have to interact with each other.  And they are
non-interactive jobs without any timeing constraints.  The kernel
allocates relatively large time slices for such jobs, e.g. when you
run 10 bzip2 jobs, each one may run for half a second alone before the
kernel switches to the other one.  So you will have very few context
switches, and cache pollution effets will be negligible.  Buf if the
threads have to communicate, like in the case of mplayer, you will
have to handle possible hundreds of context swithces per second, and
on a single CPU machine, that usually requires kernel overhead.
Lighweight threads can eliminate the kernel overhead, and a few
hundred context switch per second is still not to high, so you may
still be right that with very careful desing, you can have a good
multi-threaded media player.  But you really have to know what you are
doing, and debugging multi-threaded can be extremely difficult.

And if you have multiple processors, then you can have interaction
between real threads without kernel overhead, but then you have to be
careful accessing the same memory by two threads, because that will
create cache coherency traffic.  You also have to be carful about
being consistent which thread operate on which memory areas, because
if you switch between threads, you will trash the caches.

> I'm not an expert, please correct me, but the multi threaded approach
> should give a nice performance gain on SMP machines, right?

True, IF you know what you are doing, and the job can be efficiently
partitioned relatively evenly between the processors in a way that
they do not fight with each other over the same memory areas.

> There are not many SMP machines around (among normal users) yet, but
> CPU companies are moving toward multicore chips and maybe even HT
> processors may gain some speed.

HT is different, because there is only one cache.  You may have a
program running well on HT and running badly on real SMP.

> Your idea is that having 3 or 4 threads makes things difficult (and it
> may be true), but I would love to see what happens to the performance
> when mplayer is split into, let's say, 10 threads, by not only dividing
> the work between read/video/audio, but using a "data flow" or "pipeline"
> approach to each of them.
> For example video decoding, video postprocessing and video output can be
> done with three different threads passing frames between them in
> asynchronous way (and with some FIFO buffers between them, maybe).
> (by passing pointers and avoiding copying data around, of course)

And this means that the contents of the caches will be copied between
CPUs.  If a frame can fit into you cache, then running each filter on
a different CPU can actually slow you down.  A DVD frame is 691200
bytes, which can fit into the cache.  I think using slices can help to
reduce the cache burden, so the filter chain can finish a complete
slice before the next slice is decoded.

> If I have two CPU and a multithreaded mplayer, I can play an HDTV
> stream with heavy postprocessing (say, 80% CPU decoding, 60% CPU
> postprocessing, 30% CPU audio). A monothread mplayer would be unable
> to play unprocessed video and audio (110%) because the second
> CPU is idle. 

Yes, an HDTV frame is too big for most processor caches, so in that
case, it will probably help.

Zoli




More information about the MPlayer-dev-eng mailing list