[NUT-devel] CVS: main/DOCS/tech mpcf.txt,1.117,1.118

Rich Felker dalias at aerifal.cx
Fri Mar 3 16:28:50 CET 2006


On Fri, Mar 03, 2006 at 03:09:15PM +0100, Michael Niedermayer wrote:
> > > here are a few random problems you will have with this zero copy demuxing
> > > all solvable sure but its alot of work for very questionable gain
> > 
> > IMO the gain is not very questionable. Cutting out 25-50k of data
> 
> rich your oppinion on how much gain something has is about as much
> correlated with reality as (sign(gain + TINY_VAL*random()) * HUGE_VAL)
> :)
> 
> so i surely agree that there will be a gain in some cases, maybe most
> cases but i dont agree at all about its magnitude, IMHO its <1%
> which is not enough for the huge rewrite the world crussade for me

Critics say the same thing about my libc replacement, and then when I
actually test, memory usage by typical apps drops by 75-90% and
performance increases one-thousand-fold for some simple C functions.
The glibc-lovers of course still won't shut up after the testing.
They'll claim that 100k per process is "small" and does not matter
even when it's 50-75% of the memory used, and that everyone should be
using a 2GHz machine bought with ten-years'-wages if they want to be
able to write in their own language.

> not to mention the significantly higher complexity of the resulting
> architecture

In some ways the architecture is simpler than what we have now, which
is full of hacks. In any case, a new architecture is NECESSARY for
h264, since right now we're destroying performance by blitting a frame
up to 5 or more frames after it was decoded, totally destroying the
cache... :(

> > that's moving through the cache per frame could make a significant
> > difference to performance. And for rawvideo it could be even more
> > extreme. (Naturally some filters will require alignment/aligned
> > stride and thus copying, but direct playback should not.)
> 
> iam still in favor of fread() into the hw video buffer for rawvideo ...

I hope you mean read(). fread() will inherently be very slow.

> not to mention that rawvideo is a irrelevant and rare case where a
> few percent speed wont matter, if i seriously need fast rawvideo
> playback id write a small special propose player for it not rewrite
> a generic multimedia architecture to be able to handle it better

A good generic architecture will already support this as a consequence
of other things it needs to support for performance.

> current lavc will segfault with almost all codecs an some cpus if you feed 
> unaligned buffers into it, this can be fixed in lavc for most relatvely easily
> but it nicely shows how many people do such weird things, IMHO the whole
> zerocopy thing is idiotic, its like the singlethread player is always
> supperior rule, thers no question that fewer copies, fewer threads and
> less synchronization between threads is better, but its not like that
> could be changed in isolation, other things depend on it and the 1%
> you gain here might cause a 50% loss somewhre else

Perhaps you'd like to demonstrate that mplayer is only 1% faster than
the competition? Last I checked it was more like 10-200%, depending on
which other player you're comparing to. Naturally this is not a result
of being non-threaded by itself. It involves many factors, which
include a reduction in the number of wasteful copies, lack of need for
thread synchronization, and many other things I have no idea about.
But you know as well as anyone else in ffmpeg development that many
small things add up to huge performance advantage over the
competition.

> > > * having the bitstream initially not in the L2 cache (i think that would
> > >   be the case if you read by dma/busmastering) will mean that accesses to
> > >   the uncompressed frame and bitstream will be interleaved, todays ram
> > >   is optimized for sequential access this making the already slowest part
> > >   even slower
> > 
> > You can use prefetch instructions if needed.
> 
> wont help, and wont work (i tried this when playing with memcpy), one thing 
> which would work is to do a dummy read pass over the bitstream buffer to
> force it into the cache, the difference to copying it into another spot
> then would be quite negligible, the code is limited by the mem speed, the
> writes wouldnt cost anything, only thing you loose is a little cache
> trashing, if that has any significance in practice is doubtfull IMO

IMO it can easily be tested. Just write 20-40k of random crap to some
unused memory buffer while decoding a video that barely fits in cache
and watch the change in performance.

> > > * and yeah the whole buffer management with zerocopy will be a nightmare
> > >   especially for a generic codec-muxer architecture where codec and muxer
> > >   could run with a delay or on different threads
> > 
> > There is no buffer management on a 64bit system. You just mmap the
> > whole file. For 32bit you'll have to lock things and update the map
> > when you hit the address space limit.
> 
> you cant just update the map when you hit the end, some packets might
> still be in various buffers/queues, maybe a buffer in a muxer maybe
> a decoder, ...

All you need is a pointer keeping track of the earliest point in the
stream still needed. You don't unmap the whole map, just the part
before this point. It's a basic variable-size circular buffer
implementation (which I was planning to support in a very generic way
in a next-gen player), except with munmap/mmap instead of realloc.
IIRC it's even possible to share this between threads/processes
without any additional locking mechanisms.

> then there are non interleaved files and seeking in which cases
> a pure mmap variant on 32bit seems problematic

No, non-interleaved case is easy. You simply treat it the same as
-audiofile, i.e. open the file twice and treat the audio and video
parts separately. This is needed anyway to make -cache work. The
special-casing for non-interleaved AVI in MPlayer is a stupid hack.

> but dont hesitate to implement it, after it exists, works and has been
> benchmarked and is faster i will happily demonstrate how header compression
> can be done without any speedloss
> i mean if we are already rewriting the whole demuxer architecture, fix
> 10 differnt "issues" in lavc whats the big problem with passing 2
> bitstream buffes instead of one into the decoder? the first would
> be just the startcode and or header, so only the header parsing would
> need to use a slower bitstream reader ...

The intent was not to modify the codecs with hacks to support this
specially, but to find a way to make sure 'onecopy on top of zerocopy'
is as fast as ordinary demuxing with a single copy. The problem is
that if a demuxer implements zerocopy, its input buffer might not
actually be the filesystem cache buffer, depending on the player's
implementation. It may have already been copied once. This does not
incur a performance penalty if the demuxer guarantees that it will
always output the same buffer that was given to it, but it is a
problem if the demuxer might copy it into another third buffer.

Maybe it's acceptable just to have the additional overhead in the
demuxer for treating the two cases separately.

Anyway, I don't want to argue and flame over this. If you really want
the header compression, please either come up with a good solution or
just say that you insist on deferring that question until someone
implements zerocopy. If the latter is the case, then go ahead and do
it. It's ok with me. Just please don't bash and flame the whole
zerocopy concept which is not the subject at hand.

Rich




More information about the NUT-devel mailing list