[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

D Richard Felker III dalias at aerifal.cx
Fri Sep 10 16:59:21 CEST 2004


On Fri, Sep 10, 2004 at 12:04:26PM +0200, Roberto Ragusa wrote:
> On Thu, 9 Sep 2004 15:58:52 -0500 (CDT)
> Zoltan Hidvegi <mplayer at hzoli.2y.net> wrote:
> 
> > The kernel
> > allocates relatively large time slices for such jobs, e.g. when you
> > run 10 bzip2 jobs, each one may run for half a second alone before the
> > kernel switches to the other one.
> 
> According to vmstat there were 10 cs/s, so a timeslice of 100ms, which is
> still large, but not very far from a player requirement (10ms?).

huh? i think you got this backwards. with 100ms timeslices, video
smoothness will be DESTROYED.

> > Buf if the
> > threads have to communicate, like in the case of mplayer, you will
> > have to handle possible hundreds of context swithces per second, and
> > on a single CPU machine, that usually requires kernel overhead.
> > Lighweight threads can eliminate the kernel overhead, and a few
> > hundred context switch per second is still not to high, so you may
> > still be right that with very careful desing, you can have a good
> > multi-threaded media player.
> 
> Suppose a frame is displayed after being passed throught 10 different
> threads (audio/video effects...), we can estimate one context switch
> per frame per thread, so at 25fps it is 250 cs/s.
> (A clever assignment of priorities and a good kernel will avoid
> switching from thread to thread continuously)
> Assuming a 1000 cycles penalty for one context switch, we are "wasting"
> 250000 cycles, which is 0.01% at 2GHz.

ok, someone needs to get a clue here. 2ghz is not where this matters.
300mhz is where it matters. we are not idiot windows-video-player
writers who think it's ok to require someone to buy a new computer
every year or even every five years. anyway ruining the cache
coherency (which you WILL do if you pass video thru multiple
threads!!) will kill performance even on 2ghz.

> My idea is that, yes, there is a cost, but it is smaller than what
> we think. 250cs/s or even 1000cs/s are easily manageable. My
> timer interrupt ticks at 1000Hz even when the system is idle.
> At 1000Hz, there are still one million cycles between ticks.

1000hz timer hurts performance bad. it'll decrease overall speed by
about 5% on my system (500mhz) compared to 100hz timer.

> > But you really have to know what you are
> > doing, and debugging multi-threaded can be extremely difficult.
> 
> Well, this is the "it's difficult" argument, and I agree.
> I never said it is easy. I only say it is possible and worth trying.

no it's not.

> > And if you have multiple processors, then you can have interaction
> > between real threads without kernel overhead, but then you have to be
> > careful accessing the same memory by two threads, because that will
> > create cache coherency traffic.  You also have to be carful about
> > being consistent which thread operate on which memory areas, because
> > if you switch between threads, you will trash the caches.
> 
> You have to use message passing.
> decoder: decode a frame writing in buffer 18, send_message(thread_postprocess,
> buffer[18]); from now on buffer 18 is property of the postprocess.
> I think v4l uses this approach with video buffers.

umm, you have no idea how complicated proper buffer management is. you
can do crap like this if you want it to be slow like other players...
but this is mplayer, not slowass-newbie-crap-player-#34289342. if you
want to learn, read the long discussions on -g2-dev about video
processing.

> > And this means that the contents of the caches will be copied between
> > CPUs.  If a frame can fit into you cache, then running each filter on
> > a different CPU can actually slow you down.  A DVD frame is 691200
> > bytes, which can fit into the cache.  I think using slices can help to
> > reduce the cache burden, so the filter chain can finish a complete
> > slice before the next slice is decoded.
> 
> I doubt you can fit 691200 bytes in the (L2) cache on today processors.

this number is incorrect. a dvd frame is 518400 bytes.

> My Athlon (Thorton) has 128KiB L1 and 256KiB L2, so (AMD doesn't
> duplicate cache content) just 384KiB. A Barton has 128+512=640, still
> less than your 675. But a Duron has just 128+64 and older CPUs are
> much more limited (then there is the code/data distinction).
> Is optimizing mplayer to use 3% CPU instead of 6% CPU when playing
> a DVD on a monster CPU a reasonable goal?

actually mplayer will not even play a 640x480 snow file on a 4ghz cpu
right now. why don't you keep up with current events instead of
blabbering on and on about threads? and... maybe you want
postprocessing, plus realtime inverse telecine, plus ....

making a slow program just so you can be an idiot lazyass java-ite
thread coder is what's not a reasonable goal.

> *If* all the processing could be done on slices, then you would
> be right. But only a few algorithms can be sliced (a simple unblur
> filter has edge issues) and, wait, the multithreaded approach could
> process different slices in parallel, right?

almost everything can be sliced. it just has to be able to read some
extra boundary pixels, which is a little tricky but not impossible to
handle.

rich




More information about the MPlayer-dev-eng mailing list