[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Mon Sep 20 15:19:44 CEST 2004

On Mon, Sep 13, 2004 at 03:43:19PM +0200, Roberto Ragusa wrote:

> The summary I can make is that L2 cache thrashing is (according to you
> developers, who have direct experience) really important for performance.

Yes, but even the L1 cache, though it is very small is important.
Some time ago, Rich and I discussed a patch that changed the processing
of 8x8 pixel blocks to handle more of them at once to make better use of
the parallelism of todays CPUs. Though it should have made the specific
part a few percent faster, it made the _whole_ player a few percent
slower. Our guess at that time was, that by processing more data, it got
kicked out of L1 and had to be reloaded from L2 thus significantly
decreasing performance.
And yes, the change was reverted.

> I pointed out that a frame is almost the entire size of the cache (if not bigger),
> so the filtering stage will not have hot caches after the decoding stage, but
> apparently slice by slice processing in mplayer is more common than I thought.

MPlayer uses slices and direct rendering as often as possible. And as
long as you dont use any fancy filters (which is the case for 99.9% of
the users) you can pass data directly from the decoder to the video
memory.

> The buffer fill management is easy, but I don't know how it can be implemented
> in mplayer (as everything depends on audio, I messed with the quantity of
> audio samples, and it worked).

That is actualy how it should be done. Just not by inserting and
dropping samples, but by doing a proper resampling.

> > Nope, although schedulers are similar to state machines, they have one
> > big draw back in our case: they have no idea about the dataflow within
> > the "tasks" they run. While we exactly know what's going to happen and
> > thus can optimize on this.
> What you are saying is that instead of decoding frame 100,101,102 and then
> filtering frame 100,101,102 (which is reasonable from a scheduler point of
> view), you decode and filter 100, than 101 and then 102, hoping for cache
> benefits.

Note hoping, using them :)

> Ok, but in the "I have to wait 2ms and then output an audio frame" scenario,
> wouldn't having some video decoding interrupted by the audio player and then
> resumed on trashed caches be better than waiting 2 ms doing nothing? The video
> decoder will resume with cache misses but part of the work would be already done.
> (here, we are not considering the fact that the system can run another process
> in those 2ms).

Yes, we are actualy losing the advantage of doing work in the time we
sleep. But that's not so bad as long as we only get one or two delayed
frames, but keep the audio buffer filled. Normal people dont realize if
one frame or two are shown 10ms too late if the next are correct again.

> > And dont forget that RAM is slow compared to L2 cache.
> 
> Maybe words like speed and slow are ambigous, because they can refer to
> latency or bandwidth.
> I'd say that memory is very (latency)slow but not too (bandwidth)slow.

Yes and no. As Rich already said, RAM is slow from both sides. Though we
have yet another problem than the limited clock frequency of RAM access:
the maximum burst size. You cannot transfere a few MBs on block but you
have to split it into iirc 32 word(bus width) sized chunks.
But then again, we get mostly limited by the latency anyways.

> Reading 262144 bytes in random order from a hot 512KiB cache is a lot
> faster than doing that directly from memory, but maybe reading 262144
> bytes in sequence order from memory is not excessively slow compared
> to a read from cache.

It is. Cached architectures dont read blocks at once, but read cache
lines at once. Thus if we access a whole block we transfere each time a
cache line, read it, get the next cache line, etc pp. The only case
where really large chunks from RAM are directly read is when using some
sort of DMA. But even then we are limited by the maximum burst size.

> Current memories/chipset/processors are optimized for bandwidth nowadays
> (for the simple reason that killing latency is too hard so they put
> some caches and hope the working set is small enough to fit in).
> I'm referring to wide buses and all the prefetch and look-ahead tricks
> the hardware usually does.

And they dont really work in our case unless you design your software
carefully to make use of those caches.

> As a DVD frame is 622kB, at 25 fps we have 15MB/s which is not a
> significant part of the available RAM bandwidth.
> Estimating 15MB/s of writes from the decoder, 15MB/s of reads and 15MB/s
> of writes of one filter and 15 MB/s of reads to go to the video card
> we have "only" 60MB/s of highly sequential traffic. Today the RAM peak
> performance is measured in GB/s, right?

Yeah.. and tomorrow it is in TB/s.... but we live today and have to
design software carefully enough that it still works on the computers
from yesterday. As i said before, it's not hard to write software for
the case when you have plenty of resources.

> L2 cache is important for mpeg quantization tables and similar things,
> sure, but for raw streaming data?

We never pass around raw data, data is always processed at some point.
And even then, PCs are not streaming machines, they are reasonably good
at plain number crunching (ie calculating Pi and stupid stuff like
that), but are bad for most of the real world applications. The only
thing that makes PCs usable is that they are cheap, much cheaper than
any specialiced hardware can be.

> Didn't hardware designers come up with instructions to read/write memory
> directly bypassing the cache, explicit prefetching...? It was said
> that it's better to keep code and tables in the cache than thrashing
> everything with pixels from a frame that will not fit entirely in the
> cache at last.

See anectode above.

> A similar issue is debated on the kernel level; why should be try to
> cache gigabytes from VOB files during playing and discard all the
> things which can be useful in the future (libraries, config files,
> tmp files,...). See madvise(MADV_DONTNEED)

Well... this somewhat requires an AI with a cristal ball or an
additional flag that is not conforming to any standard.

 			Attila Kinali