[NUT-devel] CVS: main/DOCS/tech mpcf.txt,1.117,1.118

Michael Niedermayer michaelni at gmx.at
Fri Mar 3 15:09:15 CET 2006


Hi

On Thu, Mar 02, 2006 at 07:36:25PM -0500, Rich Felker wrote:
> On Fri, Mar 03, 2006 at 01:12:55AM +0100, Michael Niedermayer wrote:
> > Hi
> > 
> > On Thu, Mar 02, 2006 at 06:11:17PM -0500, Rich Felker wrote:
> > [...]
> > > > > my proposed header compression, which has negligible complexity would reduce
> > > > > the overhead by ~1% and was rejected based on nonexistant kernel and demuxer
> > > > > architectures
> > > > 
> > > > Scratch kernel; the kernel architecture for it already exists. It's in
> > > > POSIX and called posix_madvise. There is no demuxer to do zerocopy
> > > > demuxing, but in the case where decoded frames fit in L2 cache easily,
> > > > but the compressed frame is very large (i.e. high quality, high
> > > > bitrate files -- the very ones where performance is a problem)
> > > > zerocopy will make a significant improvement to performance.
> > > > Sacrificing this to remove 1% codec overhead in crappy codecs is not a
> > > > good tradeoff IMO. It would be easier to just make "MN custom MPEG4"
> > > > codec that doesn't have the wasted bytes to begin with...
> > > 
> > > One other thing with this that I forgot to mention: it would be
> > > possible to support zerocopy for non-"header-compressed" files even if
> > > header compression were supported. My reason for not wanting to have
> > > this option was that it forces any demuxer with zerocopy support to
> > > also have a duplicate demuxing system for the other case. If this can
> > > be shown not to be a problem (i.e. a trivial way to support both
> > > without significant additional code or slowdown) I'm not entirely
> > > opposed to the idea.
> > 
> > here are a few random problems you will have with this zero copy demuxing
> > all solvable sure but its alot of work for very questionable gain
> 
> IMO the gain is not very questionable. Cutting out 25-50k of data

rich your oppinion on how much gain something has is about as much
correlated with reality as (sign(gain + TINY_VAL*random()) * HUGE_VAL)
:)

so i surely agree that there will be a gain in some cases, maybe most
cases but i dont agree at all about its magnitude, IMHO its <1%
which is not enough for the huge rewrite the world crussade for me
not to mention the significantly higher complexity of the resulting
architecture


> that's moving through the cache per frame could make a significant
> difference to performance. And for rawvideo it could be even more
> extreme. (Naturally some filters will require alignment/aligned
> stride and thus copying, but direct playback should not.)

iam still in favor of fread() into the hw video buffer for rawvideo ...
not to mention that rawvideo is a irrelevant and rare case where a
few percent speed wont matter, if i seriously need fast rawvideo
playback id write a small special propose player for it not rewrite
a generic multimedia architecture to be able to handle it better


> 
> > * some bitstream readers in lavc have strict alignment requirements, frames
> >   cannot be aligned with zerocopy
> 
> With a nice component system expressing alignment requirements, stride
> requirements, etc. for all frames and not treating decoded frames
> differently, this would be handled automatically. In any case,
> high-efficiency codecs have no word alignment (sometimes not even byte
> alignment?) so I doubt this is an issue for the ones that matter.

current lavc will segfault with almost all codecs an some cpus if you feed 
unaligned buffers into it, this can be fixed in lavc for most relatvely easily
but it nicely shows how many people do such weird things, IMHO the whole
zerocopy thing is idiotic, its like the singlethread player is always
supperior rule, thers no question that fewer copies, fewer threads and
less synchronization between threads is better, but its not like that
could be changed in isolation, other things depend on it and the 1%
you gain here might cause a 50% loss somewhre else


[...]

> 
> > * several (not few) codecs write into the bitstream buffer either to fix
> >   big-little endian stuff or in at least one case reverse some lame
> >   obfuscation of a few bytes
> 
> This is probably a bad approach, for many reasons..

i fully agree but its still the way its done currently ...


> 
> > * having the bitstream initially not in the L2 cache (i think that would
> >   be the case if you read by dma/busmastering) will mean that accesses to
> >   the uncompressed frame and bitstream will be interleaved, todays ram
> >   is optimized for sequential access this making the already slowest part
> >   even slower
> 
> You can use prefetch instructions if needed.

wont help, and wont work (i tried this when playing with memcpy), one thing 
which would work is to do a dummy read pass over the bitstream buffer to
force it into the cache, the difference to copying it into another spot
then would be quite negligible, the code is limited by the mem speed, the
writes wouldnt cost anything, only thing you loose is a little cache
trashing, if that has any significance in practice is doubtfull IMO


> 
> > * and yeah the whole buffer management with zerocopy will be a nightmare
> >   especially for a generic codec-muxer architecture where codec and muxer
> >   could run with a delay or on different threads
> 
> There is no buffer management on a 64bit system. You just mmap the
> whole file. For 32bit you'll have to lock things and update the map
> when you hit the address space limit.

you cant just update the map when you hit the end, some packets might
still be in various buffers/queues, maybe a buffer in a muxer maybe
a decoder, ...
then there are non interleaved files and seeking in which cases
a pure mmap variant on 32bit seems problematic

but dont hesitate to implement it, after it exists, works and has been
benchmarked and is faster i will happily demonstrate how header compression
can be done without any speedloss
i mean if we are already rewriting the whole demuxer architecture, fix
10 differnt "issues" in lavc whats the big problem with passing 2
bitstream buffes instead of one into the decoder? the first would
be just the startcode and or header, so only the header parsing would
need to use a slower bitstream reader ...

[...]

-- 
Michael




More information about the NUT-devel mailing list