[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Fri Sep 10 22:34:27 CEST 2004

On Fri, 10 Sep 2004 10:59:21 -0400
D Richard Felker III <dalias at aerifal.cx> wrote:

It looks like you turned this thread into a flame, so I wasn't going
to reply. Then I decided to reply on technical merit, because my
purpose was only to discuss technical ideas and involve people
in thinking about issues.

> On Fri, Sep 10, 2004 at 12:04:26PM +0200, Roberto Ragusa wrote:

> > According to vmstat there were 10 cs/s, so a timeslice of 100ms, which is
> > still large, but not very far from a player requirement (10ms?).
> 
> huh? i think you got this backwards. with 100ms timeslices, video
> smoothness will be DESTROYED.

But I didn't say 100, I said 10! :-)
100ms is not far from 10ms (only a factor 10)
Anyway, I moved to 1ms (1000Hz) later to not be too optimistic.

> > Assuming a 1000 cycles penalty for one context switch, we are "wasting"
> > 250000 cycles, which is 0.01% at 2GHz.
> 
> ok, someone needs to get a clue here. 2ghz is not where this matters.
> 300mhz is where it matters.

Remove 2GHz, use 300MHz. Result is 0.1%. That is 300MHz against 299.7MHz.

> we are not idiot windows-video-player
> writers who think it's ok to require someone to buy a new computer
> every year or even every five years. anyway ruining the cache
> coherency (which you WILL do if you pass video thru multiple
> threads!!) will kill performance even on 2ghz.

So you're concerned about cache coherency on 300MHz hardware,
but it's difficult that the cache on a 300MHz hardware can contain
an entire frame (+ additional data structures + code).

> > My idea is that, yes, there is a cost, but it is smaller than what
> > we think. 250cs/s or even 1000cs/s are easily manageable. My
> > timer interrupt ticks at 1000Hz even when the system is idle.
> > At 1000Hz, there are still one million cycles between ticks.
> 
> 1000hz timer hurts performance bad. it'll decrease overall speed by
> about 5% on my system (500mhz) compared to 100hz timer.

I will not comment your data point, I assume you're right.
Just want to clarify that where I wrote "my timer interrupt", I
mean "the timer interrupt in a standard Linux kernel".

> > > And this means that the contents of the caches will be copied between
> > > CPUs.  If a frame can fit into you cache, then running each filter on
> > > a different CPU can actually slow you down.  A DVD frame is 691200
> > > bytes, which can fit into the cache.  I think using slices can help to
> > > reduce the cache burden, so the filter chain can finish a complete
> > > slice before the next slice is decoded.
> > 
> > I doubt you can fit 691200 bytes in the (L2) cache on today processors.
> 
> this number is incorrect. a dvd frame is 518400 bytes.

I didn't calculate the number, just used what the other person said.
Anyway 518400 is only correct for NTSC (720*480*1.5), PAL should be
622080 (720*576*1.5), right?

> > My Athlon (Thorton) has 128KiB L1 and 256KiB L2, so (AMD doesn't
> > duplicate cache content) just 384KiB. A Barton has 128+512=640, still
> > less than your 675. But a Duron has just 128+64 and older CPUs are
> > much more limited (then there is the code/data distinction).
> > Is optimizing mplayer to use 3% CPU instead of 6% CPU when playing
> > a DVD on a monster CPU a reasonable goal?
> 
> actually mplayer will not even play a 640x480 snow file on a 4ghz cpu
> right now. why don't you keep up with current events instead of
> blabbering on and on about threads? and... maybe you want
> postprocessing, plus realtime inverse telecine, plus ....

Maybe I didn't explain it well:
assuming that
1) fitting a frame into cache helps speed a lot
2) slow CPU have a too small cache
3) fast CPU have a (perhaps) big enough cache
doesn't a strategy assuming big caches only help fast CPUs?

> making a slow program just so you can be an idiot lazyass java-ite
> thread coder is what's not a reasonable goal.

Please, why do you use insults?
I always stated that coding would be more difficult, not easier, and
my point was that maybe the difficulty could reward us with *more* speed
and more features.

FYI, I never used Java; only asm, DSP asm and C code. I will not
consider me an expert, but I did code for performance in the past
(google for RC5 cores on ppc, if you want).

> > *If* all the processing could be done on slices, then you would
> > be right. But only a few algorithms can be sliced (a simple unblur
> > filter has edge issues) and, wait, the multithreaded approach could
> > process different slices in parallel, right?
> 
> almost everything can be sliced. it just has to be able to read some
> extra boundary pixels, which is a little tricky but not impossible to
> handle.

ok, you're confuting my first point, anyway the second still stands.

I'm just trying to promote thinking on some issues which mplayer
definitively has, according to the content of the mailing list:
1) a huge main loop, hard to understand, hard to modify because
side effects could happen
2) simple things like OSD during paused video are problematic
3) playing a 25fps video at 50fps deinterlaced is not possible
4) playing real time streams without overflows/underflows is not possible
5) playing a 1fps animation affects reaction times of keypresses

Please tell me if I'm wrong about this points (I'd like to hear that
I'm disinformed on 3) and 4) ).

The answer on this issues has often been "it's difficult to fix
with the current architecture".
And then there is the fact that mplayer is inadequate for SMP,
while SMP starts to spread around.

It's not my intention to discredit mplayer or its authors, mplayer
is still the best player around and I'm happy for that.

But if someone tries to think clearly about where the improvements
can be made and gets flames back progress is difficult.

Respectfully,
-- 
   Roberto Ragusa    mail at robertoragusa.it