[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Sun Sep 19 15:29:08 CEST 2004

On Fri, Sep 10, 2004 at 12:04:26PM +0200, Roberto Ragusa wrote:
> > The kernel
> > allocates relatively large time slices for such jobs, e.g. when you
> > run 10 bzip2 jobs, each one may run for half a second alone before the
> > kernel switches to the other one.
> According to vmstat there were 10 cs/s, so a timeslice of 100ms, which is
> still large, but not very far from a player requirement (10ms?).

Uhm.. 10ms resolution is the minimum requirement to have a somewhat
smooth playback. 100ms is 0.1s, a duration which can clearly be seen
(note, we are not talking about a TV 5m away which glims for another
half second after the electron beam passed).

> > Buf if the
> > threads have to communicate, like in the case of mplayer, you will
> > have to handle possible hundreds of context swithces per second, and
> > on a single CPU machine, that usually requires kernel overhead.
> > Lighweight threads can eliminate the kernel overhead, and a few
> > hundred context switch per second is still not to high, so you may
> > still be right that with very careful desing, you can have a good
> > multi-threaded media player.
> Suppose a frame is displayed after being passed throught 10 different
> threads (audio/video effects...), we can estimate one context switch
> per frame per thread, so at 25fps it is 250 cs/s.
> (A clever assignment of priorities and a good kernel will avoid
> switching from thread to thread continuously)
> Assuming a 1000 cycles penalty for one context switch, we are "wasting"
> 250000 cycles, which is 0.01% at 2GHz.

A cleverly designed kernel will switch context if you pass data between
threads. And those 1000 cycles are raw context switch time, ie the time
needed to save registers, change the data in the OS tables, restore the
registers of the next process. This is not the time really needed as
it's very unlikely that both threads are completly in the L1 cache.
Ie you will have to wait until the code is loaded from RAM.

> My idea is that, yes, there is a cost, but it is smaller than what
> we think. 250cs/s or even 1000cs/s are easily manageable. My
> timer interrupt ticks at 1000Hz even when the system is idle.
> At 1000Hz, there are still one million cycles between ticks.

Attention! an IRQ does not do a context switch! An IRQ switches into
interrupt mode, only the current IP/PC is storred on stack and depending
onto the architecture a few important registers. Also keep in mind that
a timer IRQ only increases a counter and checks whether the currently
running process should be preemted, the does a RETI. This is quite cheap
(very few addresses on RAM are accessed).

> > And if you have multiple processors, then you can have interaction
> > between real threads without kernel overhead, but then you have to be
> > careful accessing the same memory by two threads, because that will
> > create cache coherency traffic.  You also have to be carful about
> > being consistent which thread operate on which memory areas, because
> > if you switch between threads, you will trash the caches.
> You have to use message passing.
> decoder: decode a frame writing in buffer 18, send_message(thread_postprocess,
> buffer[18]); from now on buffer 18 is property of the postprocess.
> I think v4l uses this approach with video buffers.

Woha! Message passing is even more expensive then shared memory.
Note that we are passing here data in the range of several MB/s around.
Not just a few bytes/s like in the most common message passing systems.
memcpy is not really something you want to do too often (note that the
PCI and AGP optimized memcpy versions gave me a 5% speed boost. Note
also that memcpy was already an highly optimized and cpu specific asm
construct at that time)

> "If you know what you are doing" is always implied when dealing with
> computers (and not only in that case!) :-)

Well, actualy it applies to the whole life, thoug most people seem to
forget that.... Like a certain president of a certain country who thinks
that you can bring peace by killing people.

> > And this means that the contents of the caches will be copied between
> > CPUs.  If a frame can fit into you cache, then running each filter on
> > a different CPU can actually slow you down.  A DVD frame is 691200
> > bytes, which can fit into the cache.  I think using slices can help to
> > reduce the cache burden, so the filter chain can finish a complete
> > slice before the next slice is decoded.
> 
> I doubt you can fit 691200 bytes in the (L2) cache on today processors.

A slice of 640x16 pixel (a 2byte) still fit into the L2 cache.
Note also that anything that fills more than 20-30% of the L2 cache will
start fighting for cache lines with other memory users (ie other data
and code)

> My Athlon (Thorton) has 128KiB L1 and 256KiB L2, so (AMD doesn't
> duplicate cache content) just 384KiB. A Barton has 128+512=640, still
> less than your 675. But a Duron has just 128+64 and older CPUs are
> much more limited (then there is the code/data distinction).
> Is optimizing mplayer to use 3% CPU instead of 6% CPU when playing
> a DVD on a monster CPU a reasonable goal?

Playing an 0815 mpeg4 file on a 2GHz machine is not a problem, even wmp
can do that. Or let a CS freshman write a player app, it will prolly use
50% cpu on a 2GHz machine, but still has 50% to waste.
We are talking about these cases where you dont have a CPU and I/O
overkill, like on a 500MHz machine or when you are playing wavelet based
codecs like snow.

> Maybe it is better to assume the frame will overflow the cache
> and go to RAM, so one CPU writes and another then reads.
> Going from a DVD video to a HDTV video will not have a dramatic speed
> slowdown, and certainly the HDTV will not play slower than the current
> single thread approach (RAM is always involved, but we have 2 CPUs
> now).

If we do that, then we get a performance los of factor 2 to 100
(never actualy measured it, but it will be huge). Try switching of your
L1 and L2 cache (which is about the same as overflowing it) and run MPlayer.
And while you are at it, please measure it, i'd like to have some
numbers.

> *If* all the processing could be done on slices, then you would
> be right. But only a few algorithms can be sliced (a simple unblur
> filter has edge issues) and, wait, the multithreaded approach could
> process different slices in parallel, right?

Note: 99.999% of the people use MPlayer w/o any filters. Then not only
slices work but also direct rendering (ie directly rendering into video
memory, if you dont know that already).

> And don't forget all the extra features which can be more easily
> implemented.
[...]
> Receive two channels from DVB, filter them, composite them in a PIP
> style with half trasparency and output on CRT, TV and an MPEG file at
> the same time.

Already planned for G3, but with a much saner aproach.

> Current mplayer can display 5 channels from DVB at the same time on my
> system.
> But you have to run 5 different copies of the player and position your
> windows manually and you are not able to switch which audio you want to
> listen without closing and relaunching everything.

Architectural limitiation because of non-existant desgin.
(no, that's not a joke, that's the sad truth)

			Attila Kinali