[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON
mail at robertoragusa.it
Fri Sep 10 12:04:26 CEST 2004
On Thu, 9 Sep 2004 15:58:52 -0500 (CDT)
Zoltan Hidvegi <mplayer at hzoli.2y.net> wrote:
> The kernel
> allocates relatively large time slices for such jobs, e.g. when you
> run 10 bzip2 jobs, each one may run for half a second alone before the
> kernel switches to the other one.
According to vmstat there were 10 cs/s, so a timeslice of 100ms, which is
still large, but not very far from a player requirement (10ms?).
> Buf if the
> threads have to communicate, like in the case of mplayer, you will
> have to handle possible hundreds of context swithces per second, and
> on a single CPU machine, that usually requires kernel overhead.
> Lighweight threads can eliminate the kernel overhead, and a few
> hundred context switch per second is still not to high, so you may
> still be right that with very careful desing, you can have a good
> multi-threaded media player.
Suppose a frame is displayed after being passed throught 10 different
threads (audio/video effects...), we can estimate one context switch
per frame per thread, so at 25fps it is 250 cs/s.
(A clever assignment of priorities and a good kernel will avoid
switching from thread to thread continuously)
Assuming a 1000 cycles penalty for one context switch, we are "wasting"
250000 cycles, which is 0.01% at 2GHz.
My idea is that, yes, there is a cost, but it is smaller than what
we think. 250cs/s or even 1000cs/s are easily manageable. My
timer interrupt ticks at 1000Hz even when the system is idle.
At 1000Hz, there are still one million cycles between ticks.
> But you really have to know what you are
> doing, and debugging multi-threaded can be extremely difficult.
Well, this is the "it's difficult" argument, and I agree.
I never said it is easy. I only say it is possible and worth trying.
> And if you have multiple processors, then you can have interaction
> between real threads without kernel overhead, but then you have to be
> careful accessing the same memory by two threads, because that will
> create cache coherency traffic. You also have to be carful about
> being consistent which thread operate on which memory areas, because
> if you switch between threads, you will trash the caches.
You have to use message passing.
decoder: decode a frame writing in buffer 18, send_message(thread_postprocess,
buffer); from now on buffer 18 is property of the postprocess.
I think v4l uses this approach with video buffers.
> > I'm not an expert, please correct me, but the multi threaded approach
> > should give a nice performance gain on SMP machines, right?
> True, IF you know what you are doing, and the job can be efficiently
> partitioned relatively evenly between the processors in a way that
> they do not fight with each other over the same memory areas.
"If you know what you are doing" is always implied when dealing with
computers (and not only in that case!) :-)
> And this means that the contents of the caches will be copied between
> CPUs. If a frame can fit into you cache, then running each filter on
> a different CPU can actually slow you down. A DVD frame is 691200
> bytes, which can fit into the cache. I think using slices can help to
> reduce the cache burden, so the filter chain can finish a complete
> slice before the next slice is decoded.
I doubt you can fit 691200 bytes in the (L2) cache on today processors.
My Athlon (Thorton) has 128KiB L1 and 256KiB L2, so (AMD doesn't
duplicate cache content) just 384KiB. A Barton has 128+512=640, still
less than your 675. But a Duron has just 128+64 and older CPUs are
much more limited (then there is the code/data distinction).
Is optimizing mplayer to use 3% CPU instead of 6% CPU when playing
a DVD on a monster CPU a reasonable goal?
Maybe it is better to assume the frame will overflow the cache
and go to RAM, so one CPU writes and another then reads.
Going from a DVD video to a HDTV video will not have a dramatic speed
slowdown, and certainly the HDTV will not play slower than the current
single thread approach (RAM is always involved, but we have 2 CPUs
*If* all the processing could be done on slices, then you would
be right. But only a few algorithms can be sliced (a simple unblur
filter has edge issues) and, wait, the multithreaded approach could
process different slices in parallel, right?
> Yes, an HDTV frame is too big for most processor caches, so in that
> case, it will probably help.
And don't forget all the extra features which can be more easily
What about this?
mplayer -input name=DVB:source=/dev/dvb/adapter0/dvr0
Receive two channels from DVB, filter them, composite them in a PIP
style with half trasparency and output on CRT, TV and an MPEG file at
the same time.
It could look much too complicated but we have almost all the pieces:
input drivers, demuxers, decoders, filters, output drivers, rescalers...
we only miss compositing (not a great task) and a good infrastructure
to put pieces together (that would include negotiation of YUV2/RGB
and auto placing of rescalers or converters) and handle the
Ambitious, but not out of reach.
Current mplayer can display 5 channels from DVB at the same time on my
But you have to run 5 different copies of the player and position your
windows manually and you are not able to switch which audio you want to
listen without closing and relaunching everything.
Roberto Ragusa mail at robertoragusa.it
More information about the MPlayer-dev-eng