[NUT-devel] CVS: main/DOCS/tech mpcf.txt,1.117,1.118

Fri Mar 3 01:36:25 CET 2006

On Fri, Mar 03, 2006 at 01:12:55AM +0100, Michael Niedermayer wrote:
> Hi
> 
> On Thu, Mar 02, 2006 at 06:11:17PM -0500, Rich Felker wrote:
> [...]
> > > > my proposed header compression, which has negligible complexity would reduce
> > > > the overhead by ~1% and was rejected based on nonexistant kernel and demuxer
> > > > architectures
> > > 
> > > Scratch kernel; the kernel architecture for it already exists. It's in
> > > POSIX and called posix_madvise. There is no demuxer to do zerocopy
> > > demuxing, but in the case where decoded frames fit in L2 cache easily,
> > > but the compressed frame is very large (i.e. high quality, high
> > > bitrate files -- the very ones where performance is a problem)
> > > zerocopy will make a significant improvement to performance.
> > > Sacrificing this to remove 1% codec overhead in crappy codecs is not a
> > > good tradeoff IMO. It would be easier to just make "MN custom MPEG4"
> > > codec that doesn't have the wasted bytes to begin with...
> > 
> > One other thing with this that I forgot to mention: it would be
> > possible to support zerocopy for non-"header-compressed" files even if
> > header compression were supported. My reason for not wanting to have
> > this option was that it forces any demuxer with zerocopy support to
> > also have a duplicate demuxing system for the other case. If this can
> > be shown not to be a problem (i.e. a trivial way to support both
> > without significant additional code or slowdown) I'm not entirely
> > opposed to the idea.
> 
> here are a few random problems you will have with this zero copy demuxing
> all solvable sure but its alot of work for very questionable gain

IMO the gain is not very questionable. Cutting out 25-50k of data
that's moving through the cache per frame could make a significant
difference to performance. And for rawvideo it could be even more
extreme. (Naturally some filters will require alignment/aligned
stride and thus copying, but direct playback should not.)

> * some bitstream readers in lavc have strict alignment requirements, frames
>   cannot be aligned with zerocopy

With a nice component system expressing alignment requirements, stride
requirements, etc. for all frames and not treating decoded frames
differently, this would be handled automatically. In any case,
high-efficiency codecs have no word alignment (sometimes not even byte
alignment?) so I doubt this is an issue for the ones that matter.

> * the vlc decoding of all mpeg and h26x codecs in lavc needs a bunch of
>   zero bytes at the end to gurantee error detection before segfaulting

:(

> * several (not few) codecs write into the bitstream buffer either to fix
>   big-little endian stuff or in at least one case reverse some lame
>   obfuscation of a few bytes

This is probably a bad approach, for many reasons..

> * having the bitstream initially not in the L2 cache (i think that would
>   be the case if you read by dma/busmastering) will mean that accesses to
>   the uncompressed frame and bitstream will be interleaved, todays ram
>   is optimized for sequential access this making the already slowest part
>   even slower

You can use prefetch instructions if needed.

> * and yeah the whole buffer management with zerocopy will be a nightmare
>   especially for a generic codec-muxer architecture where codec and muxer
>   could run with a delay or on different threads

There is no buffer management on a 64bit system. You just mmap the
whole file. For 32bit you'll have to lock things and update the map
when you hit the address space limit.

> basically my oppinion on this is that its like the video filter architecture
> very strict idealistic goals which may or may not be all achievable at the
> same time but which almost certainly will never be implemented as the code
> is to complex and too many things depend on too many

IMO it's easy to implement (easier than an efficient onecopy system)
-- it's just a single mmap. The strange (mis)behavior by various
codecs is problematic, but it could possibly be solved too.

BTW even if the source is not mmapped, readonly memory, there are
still optimizations to be made by the demuxer exporting the same
memory that was passed into it, the same as MPI_EXPORT stuff in
mplayer. Whether we'll actually do any of this in the near future is
of course doubtful, but in the long term it should be possible,
especially as codecs get more and more performance-intensive and we
need to work harder and harder to squeeze out maximum performance. I
don't want to preclude this with NUT.

However, again, like I said, if you believe that it would be possible
to support both ways without a performance/complexity penalty over the
zerocopy-only implementation, I'm willing to reconsider your 'header
compression' idea.

Rich