[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Thu Sep 9 22:23:57 CEST 2004

On Wed, 8 Sep 2004 21:48:27 +0900
Attila Kinali <attila at kinali.ch> wrote:

> Having multiple processes using large amounts of
> memory (where large is more than a few cache lines) and running
> concurently is a huge decrease in performance compared to running them in a
> non-preemting, round robbing manner. Search for cache efficiency and
> related topics if you want to know more.
> Also note that context switching is very expensive, it varies depending
> on the CPU and operating system between 100s and 1000s of cycles.
> Nothing that you want to do too often.

What you wrote is very reasonable, so I tried a little test to see how
much speed is lost when running a lot of processes simultaneously.

My test is:
create ten files

  for i in `seq 0 9`; do cp /var/log/messages $i;done

now compress them (bzip2 -1 has a sorting stage operating on 100000 bytes,
which can stay inside the L2 cache on my Athlon)

  time sh -c 'for i in `seq 0 9`; do bzip2 -1 <$i >/dev/null ; done ;wait'

("wait" is useless in this case)
now try again running the ten processes in parallel

  time sh -c 'for i in `seq 0 9`; do bzip2 -1 <$i >/dev/null & done ;wait'

I was expecting a significant difference, but the results are very close.
Indeed, running the test multiple times, it can easily happen that the parallel
version takes less time.

This is on Linux 2.6.7 kernel, I suppose the scheduler is doing a good job
avoiding too many context switches.
In fact, vmstat says that there are about 130 cs/s when idle, about 130 cs/s
when executing serially and about 140 cs/s when executing in parallel.

I was trying to prove your point and actually disproved it. Have you any
easy test showing a real performance loss?

I'm not an expert, please correct me, but the multi threaded approach
should give a nice performance gain on SMP machines, right?
There are not many SMP machines around (among normal users) yet, but
CPU companies are moving toward multicore chips and maybe even HT
processors may gain some speed.

If it is confirmed by tests that performance is not seriously ruined by
the multithread approach (on good operating systems, at least),
the idea of splitting mplayer has to be taken into consideration.

Your idea is that having 3 or 4 threads makes things difficult (and it
may be true), but I would love to see what happens to the performance
when mplayer is split into, let's say, 10 threads, by not only dividing
the work between read/video/audio, but using a "data flow" or "pipeline"
approach to each of them.
For example video decoding, video postprocessing and video output can be
done with three different threads passing frames between them in
asynchronous way (and with some FIFO buffers between them, maybe).
(by passing pointers and avoiding copying data around, of course)

If I have two CPU and a multithreaded mplayer, I can play an HDTV
stream with heavy postprocessing (say, 80% CPU decoding, 60% CPU
postprocessing, 30% CPU audio). A monothread mplayer would be unable
to play unprocessed video and audio (110%) because the second
CPU is idle. 

> How does the audio thread know how long the video thread has to wait ?
> What happends if the audio thread needs more than one frame time to
> decode _and_ display it ? Also you would need to make a 
> file reader/demuxer thread, because both audio and video need it
> independently, thus you have already 3 threads to synchronize, not 2.
> How do you handle the case of a subtitle file ? Or an seperate audio
> file ?

I think the right solution is to have a global reference clock, with all
the other threads trying to be syncronous to this "metronome" thread.
This means that audio is not imposing the timing as it is now, but trying
to catchup with the metronome (via resampling, skipping samples or
manipulating hardware clock).

Then we have the choice of where the metronome takes its reference from.
Some possibilities: RTC (accurate playing of a local file), PTS
(for accurate playing of broadcasted material), sound ouput clock
(to avoid resampling), cache buffers usage (for realtime streams
from the net or DVB), video output clock (to avoid skipped and duplicated
frames when the output is, say, PAL interlaced)...

Mplayer is a wonderful tool as is now, but I found a big inconvenience
when playing real time streams (DVB, in my case). Mplayer is always
too fast and empties the buffers or too slow and overflows the buffers.

I hacked a simple solution here: I monitor the cache buffers fill status
and skip audio sample to go faster or play audio samples twice to go slower,
keeping the cache at near costant level. The implementation is
a hack, the quality of audio is compromised and it is conceptually wrong
(think about VBR streams), but I solved my problem with a few lines of code
and I'm able to play from /dev/dvb/adapter0/dvr0 reliably.
(the recent patch involving faster/slower playback may help solving
this in a better way)

Finally, having many fully decoupled functional blocks passing messages
between them could permit some heavy manipulation of the video, e.g.
frame rate changing on fly (so itc and deinterlace in real time),
a good OSD support, a good pause/play support (with OSD working even
in paused mode because the display thread can go on while decoding
is stopped), analyzing or dumping blocks everywhere and lots of
crazy features (multi input support with PIP and audio mixing, with
one source paused and another playing at double speed)...

The "use sound as time base" approach simplifies the basic functionalities
but creates nasty problems (if my sound board plays at about 48600Hz instead
of 48000Hz, the video speed is 1% faster, emptying my buffers).
Let me also say that we have the "-nosound" option, so we have to suddenly
abandon the sound base and switch to a RTC mode.
But then we have -autosync, so we're doing an hybrid of two clock sources
and trying to keep the A-V delta under control.
Compare to alternative approach: everyone syncs to the reference.
How the reference is generated is not important.

The recent discussions about tuning the delays and loop spinning speed
and the risk of messing everything up is a clear consequence of a
basic design which is going beyond its possibilities.

> I still think it's not true. I agree that the current state machine (an
> event loop is something different) is not easy to understand and
> contains a lot of obfuscated code.

IMHO both a state machine or an event loop work in this way: you
decide what to do next based on program flow or a global state
or a table of events. When you decide to do "X", you do "X", then
you go back to another decision. You are just implementing a (hopefully
smart) scheduler, but it is just a cooperative multitasking scheduler.
If "X" takes 200ms, bye bye syncronization.
With multiple threads (on a single CPU) you're doing a similar thing,
but it is now preemptive multitasking. I can stop the decoding of a frame
to write some audio to the sound card immediately.
For example: I'm displaying frame 100 and playing sound 100, I have
frame 101 and sound 101 already decoded in my buffers, but they have to
go out 10ms and 12ms later than now, I can consider decoding frame 102
now, but if 10ms are not enough I will miss the 101 deadline.
If decoding is done by a thread I go on decoding frame 102 and after
10ms the video ouput thread can preempt the decoder and show frame 101,
let the decoder continue to work on frame 102 and then the same happens
for the sound after 2ms.
With a good priority strategy (output thread maximum priority, input
thread minimum) you can get a very low latency or use big FIFO
buffers to handle sudden CPU unavailability caused by other processes.

Another example: decoding video takes 50% CPU, decoding sound 10%
and postprocessing video 60%. With a not threaded approach there is
only one soluton: no postprocessing; with threads I can enable postprocessing
with a low priority, so when I am displaying frame 100 and I have frame 101
and 102 already decoded (and can't decode 103 yet) and I have 10ms before
displaying frame 101 all the threads are idle, so the postprocessing thread
can work on frame 101. After 10ms we have to display frame 101 and at that
time we can take 101proc if it's done or else 101 and abort the postproc
if 101proc is not ready yet. A clever strategy (the postprocessor can decide
that with 10ms left maybe it should try postprocessing frame 102) you can
postprocess all the frames the current CPU load permits, without any delay.

And don't forget what happens when many CPU are available...

Am I describing DirectX Graph Flows? Am I describing gstreamer? I don't
know, it's just the way I see it.

Sorry, I'm realizing I've written an essay without wanting.
In conclusion, the threaded approach has great potential, but two
great disadvantages too:

1) less performance in basic situations
2) difficult coding

As my stupid bzip2 test seems to say that 1) is not serious (or just not
existent), it remains 2). But 2) is just a challenge waiting someone
to accept it.

I expect the answer to what I wrote is "so try it yourself and tell us
if it works", so let me say I have no time to make experimentation
on this matters, but I would be interested in helping on some aspects
(all the sync stuff looks cool to work on).

I hope to receive more insightful comments than flames :-)

-- 
   Roberto Ragusa    mail at robertoragusa.it