[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Attila Kinali attila at kinali.ch
Sun Sep 12 02:32:49 CEST 2004


DO NOT SEND MAILS DIRECTLY TO DEVELOPERS!
We all read this list!

On Thu, Sep 09, 2004 at 10:23:57PM +0200, Roberto Ragusa wrote:
> On Wed, 8 Sep 2004 21:48:27 +0900
> Attila Kinali <attila at kinali.ch> wrote:
 
> I was expecting a significant difference, but the results are very close.
> Indeed, running the test multiple times, it can easily happen that the parallel
> version takes less time.

A reason was already givven.

> I'm not an expert, please correct me, but the multi threaded approach
> should give a nice performance gain on SMP machines, right?
> There are not many SMP machines around (among normal users) yet, but
> CPU companies are moving toward multicore chips and maybe even HT
> processors may gain some speed.

Not really. SMP machines will have a hard time keeping cache consistent
among different processors. If you start using different cpus that
operate on the same memory regions at the same time, you will be either
losing cache consistency (if it's a bad smp design) or the traffic
between the cache modules will block the cpus.
Also splitting MPlayer into a thread per filter stage will not help
much, but increase the problem of context switches. MPlayer writes
slices of a picture (if possible of course), meaning that it writes just
decoded stripes instead the whole picture. This allows MPlayer to
process data while it is in the L2 cache (or for really small picture regions
even in the L1 cache) and thus gives a huge performance improvement.
Even on smp machines this will generate a lot of context switches (a
stripe has a width of 8 pixels and a 0815 movie is 30fps with something
around 640x480 -> 640/8 * 30 = 2400 stripes/s) and you will lose the
advantage of the L2 cache. Also processing images as a whole will have a
huge impact on performance as the image will be swapped out of cache.

Even HT will not help here. HT is designed for the case if you have two
processes which both fit into the L2 cache, but suddenly one has to wait
for data from ram. I actualy doubt that HT will get any performance at
all.

> If it is confirmed by tests that performance is not seriously ruined by
> the multithread approach (on good operating systems, at least),
> the idea of splitting mplayer has to be taken into consideration.

Only if the complexity isnt increased.

> Your idea is that having 3 or 4 threads makes things difficult (and it
> may be true), but I would love to see what happens to the performance
> when mplayer is split into, let's say, 10 threads, by not only dividing
> the work between read/video/audio, but using a "data flow" or "pipeline"
> approach to each of them.
> For example video decoding, video postprocessing and video output can be
> done with three different threads passing frames between them in
> asynchronous way (and with some FIFO buffers between them, maybe).
> (by passing pointers and avoiding copying data around, of course)

On a single CPU machine this will have a huge impact on cache
efficiency. On a SMP machine passing pointers does not help as the data
has to be copied between CPUs anyways.

> If I have two CPU and a multithreaded mplayer, I can play an HDTV
> stream with heavy postprocessing (say, 80% CPU decoding, 60% CPU
> postprocessing, 30% CPU audio). A monothread mplayer would be unable
> to play unprocessed video and audio (110%) because the second
> CPU is idle. 

If, and only if you can utilize both CPUs w/o too much traffic between
them. Read the performance analysis papers of parallel computers and
clusters. All of them are limited by the traffic between the nodes.
A better aproach would be to use the slot of the second CPU as an
expansion slot and put in a HW decoder. Ie put in an FPGA and do all CPU
intensive stuff (like DCT, colour space conversion) in HW. Alternatively
you can use a ram slot for which you switched caching off.

> > How does the audio thread know how long the video thread has to wait ?
> > What happends if the audio thread needs more than one frame time to
> > decode _and_ display it ? Also you would need to make a 
> > file reader/demuxer thread, because both audio and video need it
> > independently, thus you have already 3 threads to synchronize, not 2.
> > How do you handle the case of a subtitle file ? Or an seperate audio
> > file ?
> 
> I think the right solution is to have a global reference clock, with all
> the other threads trying to be syncronous to this "metronome" thread.
> This means that audio is not imposing the timing as it is now, but trying
> to catchup with the metronome (via resampling, skipping samples or
> manipulating hardware clock).

This is _not_ that easy. Also an not 100% stable reference clock will
result in wobbling of the sound (keep here in mind that even a stable
reference quartz doesnt mean that the clock you see as a process is
stable).

> Then we have the choice of where the metronome takes its reference from.
> Some possibilities: RTC (accurate playing of a local file), PTS
> (for accurate playing of broadcasted material), sound ouput clock
> (to avoid resampling), cache buffers usage (for realtime streams
> from the net or DVB), video output clock (to avoid skipped and duplicated
> frames when the output is, say, PAL interlaced)...

The only sources which are stable enough come directly from a quartz, ie
only the RTC or the sound card can be used. All others have too big
phase fluctuations. I also rule out the video output clock because of
its coarse resolution of 30ms (at 30fps).

> Mplayer is a wonderful tool as is now, but I found a big inconvenience
> when playing real time streams (DVB, in my case). Mplayer is always
> too fast and empties the buffers or too slow and overflows the buffers.

Hmm.. now that's an interesting problem.

> I hacked a simple solution here: I monitor the cache buffers fill status
> and skip audio sample to go faster or play audio samples twice to go slower,
> keeping the cache at near costant level. The implementation is
> a hack, the quality of audio is compromised and it is conceptually wrong
> (think about VBR streams), but I solved my problem with a few lines of code
> and I'm able to play from /dev/dvb/adapter0/dvr0 reliably.
> (the recent patch involving faster/slower playback may help solving
> this in a better way)

Is it possible to extract a clock from the DVB stream, w/o relying on
the bandwidth ? If so, this should be used to adjust MPlayer's
reference. Ie implementing a PLL in software.

> Finally, having many fully decoupled functional blocks passing messages
> between them could permit some heavy manipulation of the video, e.g.
> frame rate changing on fly (so itc and deinterlace in real time),
> a good OSD support, a good pause/play support (with OSD working even
> in paused mode because the display thread can go on while decoding
> is stopped), analyzing or dumping blocks everywhere and lots of
> crazy features (multi input support with PIP and audio mixing, with
> one source paused and another playing at double speed)...

None of them depends on multithreading and none of them will get easier
by splitting them into multiple threads.

> The "use sound as time base" approach simplifies the basic functionalities
> but creates nasty problems (if my sound board plays at about 48600Hz instead
> of 48000Hz, the video speed is 1% faster, emptying my buffers).
> Let me also say that we have the "-nosound" option, so we have to suddenly
> abandon the sound base and switch to a RTC mode.
> But then we have -autosync, so we're doing an hybrid of two clock sources
> and trying to keep the A-V delta under control.
> Compare to alternative approach: everyone syncs to the reference.
> How the reference is generated is not important.

It is. The reference needs to have a certain phase stability. An
constant A/V desync is less visible than a wobbling of it. Not to talk
about that even the slightes wobbling in audio is very annoying.

> The recent discussions about tuning the delays and loop spinning speed
> and the risk of messing everything up is a clear consequence of a
> basic design which is going beyond its possibilities.

I would rather say it's a  clear sign that playing videos from different
sources on different hardware with only one application is not that
easy.

> 
> > I still think it's not true. I agree that the current state machine (an
> > event loop is something different) is not easy to understand and
> > contains a lot of obfuscated code.
> 
> IMHO both a state machine or an event loop work in this way: you
> decide what to do next based on program flow or a global state
> or a table of events. When you decide to do "X", you do "X", then

Nope, an event loop reacts on events which happen at random points in
time, while a state machine is triggered periodicaly and is working on
its internal state.
Ok, the state machine can be seen as a special case of an event loop
where only one type of even occurs at evenly spaced intervals.
But they are clearly not the same thing.

> you go back to another decision. You are just implementing a (hopefully
> smart) scheduler, but it is just a cooperative multitasking scheduler.
> If "X" takes 200ms, bye bye syncronization.

Nope, although schedulers are similar to state machines, they have one
big draw back in our case: they have no idea about the dataflow within
the "tasks" they run. While we exactly know what's going to happen and
thus can optimize on this.

> With multiple threads (on a single CPU) you're doing a similar thing,
> but it is now preemptive multitasking. I can stop the decoding of a frame
> to write some audio to the sound card immediately.

And destroy cache coherency completly.

> For example: I'm displaying frame 100 and playing sound 100, I have
> frame 101 and sound 101 already decoded in my buffers, but they have to
> go out 10ms and 12ms later than now, I can consider decoding frame 102
> now, but if 10ms are not enough I will miss the 101 deadline.
> If decoding is done by a thread I go on decoding frame 102 and after
> 10ms the video ouput thread can preempt the decoder and show frame 101,
> let the decoder continue to work on frame 102 and then the same happens
> for the sound after 2ms.

This aproach was taken by MPlayerXP, see there for results.
(I dont really know why they stopped, either it didnt work out or had a
lack of developers)

> With a good priority strategy (output thread maximum priority, input
> thread minimum) you can get a very low latency or use big FIFO
> buffers to handle sudden CPU unavailability caused by other processes.

FIFOs polute your cache: performance los.

 
> And don't forget what happens when many CPU are available...

And dont forget that RAM is slow compared to L2 cache.

> Am I describing DirectX Graph Flows? Am I describing gstreamer? I don't
> know, it's just the way I see it.

Yes it's somewhat similar. But DirectX (or rather DShow) sucks because
of its bad design and gstreamer seems to be very slow.

> Sorry, I'm realizing I've written an essay without wanting.
> In conclusion, the threaded approach has great potential, but two
> great disadvantages too:
> 
> 1) less performance in basic situations

Actualy less performance in nearly all situations.
Beside that the basic situation is used in 99% of all cases
(most users dont even use filters).

> 2) difficult coding
> 
> As my stupid bzip2 test seems to say that 1) is not serious (or just not
> existent), it remains 2). But 2) is just a challenge waiting someone
> to accept it.
> 
> I expect the answer to what I wrote is "so try it yourself and tell us
> if it works", so let me say I have no time to make experimentation
> on this matters, but I would be interested in helping on some aspects
> (all the sync stuff looks cool to work on).

You got it :)

> I hope to receive more insightful comments than flames :-)

Sorry, our master flamer got cured in tibet ;)

				Attila Kinali




More information about the MPlayer-dev-eng mailing list