[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

D Richard Felker III dalias at aerifal.cx
Sat Sep 11 03:15:47 CEST 2004


On Fri, Sep 10, 2004 at 10:34:27PM +0200, Roberto Ragusa wrote:
> On Fri, 10 Sep 2004 10:59:21 -0400
> D Richard Felker III <dalias at aerifal.cx> wrote:
> 
> It looks like you turned this thread into a flame, so I wasn't going
> to reply. Then I decided to reply on technical merit, because my
> purpose was only to discuss technical ideas and involve people
> in thinking about issues.

ok i'll write a more technical reply too.

> > On Fri, Sep 10, 2004 at 12:04:26PM +0200, Roberto Ragusa wrote:
> 
> > > According to vmstat there were 10 cs/s, so a timeslice of 100ms, which is
> > > still large, but not very far from a player requirement (10ms?).
> > 
> > huh? i think you got this backwards. with 100ms timeslices, video
> > smoothness will be DESTROYED.
> 
> But I didn't say 100, I said 10! :-)

look at what i quoted. you said 100ms. maybe you wrote the numbers
wrong though.

> 100ms is not far from 10ms (only a factor 10)

100ms is several frames' duration. 10ms is less than one frame. HUGE
difference.

> Anyway, I moved to 1ms (1000Hz) later to not be too optimistic.
> 
> > > Assuming a 1000 cycles penalty for one context switch, we are "wasting"
> > > 250000 cycles, which is 0.01% at 2GHz.
> > 
> > ok, someone needs to get a clue here. 2ghz is not where this matters.
> > 300mhz is where it matters.
> 
> Remove 2GHz, use 300MHz. Result is 0.1%. That is 300MHz against 299.7MHz.
> 
> > we are not idiot windows-video-player
> > writers who think it's ok to require someone to buy a new computer
> > every year or even every five years. anyway ruining the cache
> > coherency (which you WILL do if you pass video thru multiple
> > threads!!) will kill performance even on 2ghz.
> 
> So you're concerned about cache coherency on 300MHz hardware,
> but it's difficult that the cache on a 300MHz hardware can contain
> an entire frame (+ additional data structures + code).

if it's a 300 mhz p2 (not celeron) or a xeon (did xeons that slow even
exist? i'm not sure), then it probably has a plenty cache. even if
not, you'll be playing a smaller movie (typically 640x288 or 512x384)
since a box that slow can never play dvd resolution.

> > > My idea is that, yes, there is a cost, but it is smaller than what
> > > we think. 250cs/s or even 1000cs/s are easily manageable. My
> > > timer interrupt ticks at 1000Hz even when the system is idle.
> > > At 1000Hz, there are still one million cycles between ticks.
> > 
> > 1000hz timer hurts performance bad. it'll decrease overall speed by
> > about 5% on my system (500mhz) compared to 100hz timer.
> 
> I will not comment your data point, I assume you're right.
> Just want to clarify that where I wrote "my timer interrupt", I
> mean "the timer interrupt in a standard Linux kernel".

so do i... i don't think your data scales right. newer cpus have
probably taken a lot of steps to make context switch less expensive.

> > > > And this means that the contents of the caches will be copied between
> > > > CPUs.  If a frame can fit into you cache, then running each filter on
> > > > a different CPU can actually slow you down.  A DVD frame is 691200
> > > > bytes, which can fit into the cache.  I think using slices can help to
> > > > reduce the cache burden, so the filter chain can finish a complete
> > > > slice before the next slice is decoded.
> > > 
> > > I doubt you can fit 691200 bytes in the (L2) cache on today processors.
> > 
> > this number is incorrect. a dvd frame is 518400 bytes.
> 
> I didn't calculate the number, just used what the other person said.
> Anyway 518400 is only correct for NTSC (720*480*1.5), PAL should be
> 622080 (720*576*1.5), right?

yep.

> > > My Athlon (Thorton) has 128KiB L1 and 256KiB L2, so (AMD doesn't
> > > duplicate cache content) just 384KiB. A Barton has 128+512=640, still
> > > less than your 675. But a Duron has just 128+64 and older CPUs are
> > > much more limited (then there is the code/data distinction).
> > > Is optimizing mplayer to use 3% CPU instead of 6% CPU when playing
> > > a DVD on a monster CPU a reasonable goal?
> > 
> > actually mplayer will not even play a 640x480 snow file on a 4ghz cpu
> > right now. why don't you keep up with current events instead of
> > blabbering on and on about threads? and... maybe you want
> > postprocessing, plus realtime inverse telecine, plus ....
> 
> Maybe I didn't explain it well:
> assuming that
> 1) fitting a frame into cache helps speed a lot

yes, but even if a whole frame doesn't fit, cache coherency still
helps. for example, when decoding a frame, assuming motion isn't too
chaotic, motion vectors for several subsequent macroblocks will come
from the same general area of the picture, and it will help for this
area to be in the cache. also, cache helps a lot with slice
processing.

if you want to compare, try playing a really big movie that won't fit
in your cache, first with cache, then with cache disabled in the bios
setup... :)

> 2) slow CPU have a too small cache

not true. my k6-3 has 64k l1, 256k l2 (at cpu speed), and 1meg l3 (on
motherboard, bus speed).

> 3) fast CPU have a (perhaps) big enough cache
> doesn't a strategy assuming big caches only help fast CPUs?

re-read what i just said about snow. snow already requires an insanely
fast cpu....in fact at present it can't be decoded realtime on _any_
cpu. this will change once we get some more optimization but it's
still going to be intensive.

basically, if decoding movies is only taking 3-6% cpu time, it's time
to design a new codec that's a lot more space-efficient, because the
old speed restrictions don't apply anymore.

> > > *If* all the processing could be done on slices, then you would
> > > be right. But only a few algorithms can be sliced (a simple unblur
> > > filter has edge issues) and, wait, the multithreaded approach could
> > > process different slices in parallel, right?
> > 
> > almost everything can be sliced. it just has to be able to read some
> > extra boundary pixels, which is a little tricky but not impossible to
> > handle.
> 
> ok, you're confuting my first point, anyway the second still stands.
> 
> I'm just trying to promote thinking on some issues which mplayer
> definitively has, according to the content of the mailing list:
> 1) a huge main loop, hard to understand, hard to modify because
> side effects could happen

agree. threading makes this worse, not better.

> 2) simple things like OSD during paused video are problematic

changing main won't help this one bit. it's a fundamental problem that
can't be overcome without making the player slower. once you've
displayed a frame, it's very possible that it no longer exists
anywhere readable, so you can't change osd on it without getting a new
frame.

one solution (however i consider this a hack) is to insert an extra
filter that "keeps" a copy of the last frame when you hit the pause
key, then decode one more frame before actually pausing. but that
sucks because you can't pause on the frame you actually wanted.

> 3) playing a 25fps video at 50fps deinterlaced is not possible

this is a fundamental problem in the video filter layer and the main
loop structure. again it has nothing to do with being non-threaded.
mplayer g2 can do full-rate deinterlacing just fine, because it runs
the filter chain in the correct order (pulling from the end).

> 4) playing real time streams without overflows/underflows is not possible

not sure what this means exactly.

> 5) playing a 1fps animation affects reaction times of keypresses

yes, this sucks. but it's easy to write a correct main loop without
this problem.

> Please tell me if I'm wrong about this points (I'd like to hear that
> I'm disinformed on 3) and 4) ).

you're probably correct, but it's most definitely not a fundamental
problem of a single-task state machine implementation. it's just that
mplayer's main sucks.

> The answer on this issues has often been "it's difficult to fix
> with the current architecture".

yes. again, the current architecture is BAD. that's why arpi started
mplayer g2. but notice mplayer g2 isn't threaded either, because
threads aren't the answer. we all know how to make a good architecture
with a proper main loop, but we're just too lazy to code anything.

> And then there is the fact that mplayer is inadequate for SMP,
> while SMP starts to spread around.

a good design can use threads, but does not depend on it. most of the
work i was doing on designing a new video layer couldn't care less
whether the calling app is using threads or not...

> But if someone tries to think clearly about where the improvements
> can be made and gets flames back progress is difficult.

i'm just tired of pro-thread propaganda...

rich





More information about the MPlayer-dev-eng mailing list