
On Wed, Mar 01, 2006 at 03:19:40PM +0100, Michael Niedermayer CVS wrote:
CVS change done by Michael Niedermayer CVS
Update of /cvsroot/mplayer/main/DOCS/tech In directory mail:/var2/tmp/cvs-serv10365
Modified Files: mpcf.txt Log Message: add forward_ptr to syncpoint (+0.006% overhead) give syncpoint and frameheader their own checksums (worst case overhead increase <0.006%) fix filestructure so that extendability is restored move index_ptr to the fileend so that index packets arent a special case with their reserved_bytes position -> all packets follow the same structure now
remove "optional" word from info packets, they are not more optional then index packets
split index packets note, this is entirely optional and a muxer which has difficulty with it can always output a single index packet
remove the index MUST be at the file end if anywher rule, its not needed anymore as index_ptr will always be at the end
info frames must be keyframes
last info frame is the most correct
comments, flames?
I'm not strongly against anything here, but it would've been better if you would have showed the patch (and resolved conflicts) before committing... :/
index: - index_startcode f(64) - forward_ptr v max_pts v + syncpoint_start v
Is syncpoint_start necessary? Seems redundant to me. BTW, it is non obvious from this spec that the index is split by syncpoints and how to use, it should be elaborated better (by an entry for syncpoint_start perhaps..)
+file: + file_id_string + while(bytes_left > 8){
I'm a bit weirded out by this, In a very extreme and silly example, in a truncated NUT file you could loose the last frame because it (and the frame header) was smaller than 8 bytes...
@@ -474,12 +508,15 @@ 1 is_key if set, frame is keyframe 2 end_of_relevance if set, stream has no relevance on presentation. (EOR) + 4 has_checksum if set then the frame header contains a checksum
Like I said, this should be a NUT flag...
EOR frames MUST be zero-length and must be set keyframe. All streams SHOULD end with EOR, where the pts of the EOR indicates the end presentation time of the final frame. An EOR set stream is unset by the first content frames. EOR can only be unset in streams with zero decode_delay . + has_checksum must be set if the frame is larger then 2*max_distance or its
I still feel this should be a seperate variable, the only reason you gave so far against it is that poor demuxers won't be able to decide... And IMO that's a very poor argument...
@@ -612,11 +645,8 @@ that EOR. EOR is unset by the first keyframe after it.
index_ptr - Length in bytes of the entire index, from the first byte of the - startcode until the last byte of the checksum. - Note: A demuxer can use this to find the index when it is written at - EOF, as index_ptr will always be 12 bytes before the end of file if - there is an index at all. + absolute location in the file of the first byte of the startcode of the + first index packet, or 0 if there is no index
This would be a silly argument, but it does limit the filesize to 64 bits... Doesn't matter, it's not very different from limiting the index to 64 bits...
Info tags: ---------- @@ -630,6 +660,8 @@ file. Positive chapter_id's are real chapters and MUST NOT overlap. Negative chapter_id indicate a sub region of file and not a real chapter. chapter_id MUST be unique to the region it represents. + chapter_id n MUST not be used unless there are at least n chapters in the + file
Could you explain this, I'm confused... - ods15

Hi On Thu, Mar 02, 2006 at 10:38:54AM +0200, Oded Shimon wrote:
On Wed, Mar 01, 2006 at 03:19:40PM +0100, Michael Niedermayer CVS wrote:
CVS change done by Michael Niedermayer CVS
Update of /cvsroot/mplayer/main/DOCS/tech In directory mail:/var2/tmp/cvs-serv10365
Modified Files: mpcf.txt Log Message: add forward_ptr to syncpoint (+0.006% overhead) give syncpoint and frameheader their own checksums (worst case overhead increase <0.006%) fix filestructure so that extendability is restored move index_ptr to the fileend so that index packets arent a special case with their reserved_bytes position -> all packets follow the same structure now
remove "optional" word from info packets, they are not more optional then index packets
split index packets note, this is entirely optional and a muxer which has difficulty with it can always output a single index packet
remove the index MUST be at the file end if anywher rule, its not needed anymore as index_ptr will always be at the end
info frames must be keyframes
last info frame is the most correct
comments, flames?
I'm not strongly against anything here, but it would've been better if you would have showed the patch (and resolved conflicts) before committing... :/
i did post a patch and adapted it based on comments ...
index: - index_startcode f(64) - forward_ptr v max_pts v + syncpoint_start v
Is syncpoint_start necessary? Seems redundant to me. BTW, it is non obvious from this spec that the index is split by syncpoints and how to use, it should be elaborated better (by an entry for syncpoint_start perhaps..)
fixed
+file: + file_id_string + while(bytes_left > 8){
I'm a bit weirded out by this, In a very extreme and silly example, in a truncated NUT file you could loose the last frame because it (and the frame header) was smaller than 8 bytes...
cut randomly -> u very likely loose something as it has been cut in the middle cut after a frame -> why dont you just add 8 zero bytes then too if you already do parse things? anyway, if you and rich want we could add the thing back in the index but before the reserved bytes and store a index packet with 0 syncpoints and reserved_bytes always none at the end or do you have a alternative idea?
@@ -474,12 +508,15 @@ 1 is_key if set, frame is keyframe 2 end_of_relevance if set, stream has no relevance on presentation. (EOR) + 4 has_checksum if set then the frame header contains a checksum
Like I said, this should be a NUT flag...
and i disagree
EOR frames MUST be zero-length and must be set keyframe. All streams SHOULD end with EOR, where the pts of the EOR indicates the end presentation time of the final frame. An EOR set stream is unset by the first content frames. EOR can only be unset in streams with zero decode_delay . + has_checksum must be set if the frame is larger then 2*max_distance or its
I still feel this should be a seperate variable, the only reason you gave so far against it is that poor demuxers won't be able to decide... And IMO that's a very poor argument...
every additional field adds complexity, what do we gain here with that field? personally i would fix max_distance and not store it at all, as i fear people will set it to random values
@@ -612,11 +645,8 @@ that EOR. EOR is unset by the first keyframe after it.
index_ptr - Length in bytes of the entire index, from the first byte of the - startcode until the last byte of the checksum. - Note: A demuxer can use this to find the index when it is written at - EOF, as index_ptr will always be 12 bytes before the end of file if - there is an index at all. + absolute location in the file of the first byte of the startcode of the + first index packet, or 0 if there is no index
This would be a silly argument, but it does limit the filesize to 64 bits... Doesn't matter, it's not very different from limiting the index to 64 bits...
if its >64bits just store the 64 least significant, and look at the closest such point from the end -> identical to relative ptr note, the alternative of putting a "v" style index_ptr in an zero syncpoint index at the end would solve this though IMO the complexity is not worth it
Info tags: ---------- @@ -630,6 +660,8 @@ file. Positive chapter_id's are real chapters and MUST NOT overlap. Negative chapter_id indicate a sub region of file and not a real chapter. chapter_id MUST be unique to the region it represents. + chapter_id n MUST not be used unless there are at least n chapters in the + file
Could you explain this, I'm confused...
yes, we dont want to use arbitrary precission math for chapter ids without this rule, someone could simply come up with the idea of using strings as chapter ids, which wont be fun for most demuxers [...] -- Michael

On Thu, Mar 02, 2006 at 11:55:41AM +0100, Michael Niedermayer wrote:
On Thu, Mar 02, 2006 at 10:38:54AM +0200, Oded Shimon wrote:
On Wed, Mar 01, 2006 at 03:19:40PM +0100, Michael Niedermayer CVS wrote:
comments, flames?
I'm not strongly against anything here, but it would've been better if you would have showed the patch (and resolved conflicts) before committing... :/
i did post a patch and adapted it based on comments ...
Yes but there was obviously still some controversy over some parts of the patch when you committed..
+file: + file_id_string + while(bytes_left > 8){
I'm a bit weirded out by this, In a very extreme and silly example, in a truncated NUT file you could loose the last frame because it (and the frame header) was smaller than 8 bytes...
cut randomly -> u very likely loose something as it has been cut in the middle cut after a frame -> why dont you just add 8 zero bytes then too if you already do parse things? anyway, if you and rich want we could add the thing back in the index but before the reserved bytes and store a index packet with 0 syncpoints and reserved_bytes always none at the end or do you have a alternative idea?
No real ideas, and I agree that my example was very silly. I just don't like special casing demuxing right before the end. The demuxer should always base demuxing on the next startcode/framecode ...
@@ -474,12 +508,15 @@ 1 is_key if set, frame is keyframe 2 end_of_relevance if set, stream has no relevance on presentation. (EOR) + 4 has_checksum if set then the frame header contains a checksum
Like I said, this should be a NUT flag...
and i disagree
You lost me on this one.. Is it because of coded_stream_flags ?.. It might be a good idea to modify that so it can accept NUT flags as well, but has_checksum is very obviously a NUT flag, not a stream flag..
EOR frames MUST be zero-length and must be set keyframe. All streams SHOULD end with EOR, where the pts of the EOR indicates the end presentation time of the final frame. An EOR set stream is unset by the first content frames. EOR can only be unset in streams with zero decode_delay . + has_checksum must be set if the frame is larger then 2*max_distance or its
I still feel this should be a seperate variable, the only reason you gave so far against it is that poor demuxers won't be able to decide... And IMO that's a very poor argument...
every additional field adds complexity, what do we gain here with that field? personally i would fix max_distance and not store it at all, as i fear people will set it to random values
? if (frame_size > 2*nut->max_distance) if (frame_size > sc->max_size) I fail to see the added complexity... As for what do we gain - more freedom to the muxer, and the ability of limiting for ex. audio to a very small size and not increasing overhead, and limiting video to slightly higher to avoid additional overhead... - ods15

Hi On Thu, Mar 02, 2006 at 01:25:38PM +0200, Oded Shimon wrote: [...]
EOR frames MUST be zero-length and must be set keyframe. All streams SHOULD end with EOR, where the pts of the EOR indicates the end presentation time of the final frame. An EOR set stream is unset by the first content frames. EOR can only be unset in streams with zero decode_delay . + has_checksum must be set if the frame is larger then 2*max_distance or its
I still feel this should be a seperate variable, the only reason you gave so far against it is that poor demuxers won't be able to decide... And IMO that's a very poor argument...
every additional field adds complexity, what do we gain here with that field? personally i would fix max_distance and not store it at all, as i fear people will set it to random values
? if (frame_size > 2*nut->max_distance)
if (frame_size > sc->max_size)
I fail to see the added complexity...
As for what do we gain - more freedom to the muxer, and the ability of limiting for ex. audio to a very small size and not increasing overhead, and limiting video to slightly higher to avoid additional overhead...
agree but only if we store all these max_size/distance stuff in u(16) so they cannot be arbitrary large [...] -- Michael

On Thu, Mar 02, 2006 at 04:29:50PM +0100, Michael Niedermayer wrote:
Hi
On Thu, Mar 02, 2006 at 01:25:38PM +0200, Oded Shimon wrote: [...]
EOR frames MUST be zero-length and must be set keyframe. All streams SHOULD end with EOR, where the pts of the EOR indicates the end presentation time of the final frame. An EOR set stream is unset by the first content frames. EOR can only be unset in streams with zero decode_delay . + has_checksum must be set if the frame is larger then 2*max_distance or its
I still feel this should be a seperate variable, the only reason you gave so far against it is that poor demuxers won't be able to decide... And IMO that's a very poor argument...
every additional field adds complexity, what do we gain here with that field? personally i would fix max_distance and not store it at all, as i fear people will set it to random values
? if (frame_size > 2*nut->max_distance)
if (frame_size > sc->max_size)
I fail to see the added complexity...
As for what do we gain - more freedom to the muxer, and the ability of limiting for ex. audio to a very small size and not increasing overhead, and limiting video to slightly higher to avoid additional overhead...
agree but only if we store all these max_size/distance stuff in u(16) so they cannot be arbitrary large
NO! They really need to be arbitrarily large! If you have a file where each frame is at least 5 megs, max_distance is useless unless it's well over 5 megs! If people set max_distance too high for their content, that just means their file will be very vulnerable to damage. This is a tradeoff that belongs in the hands of the user. If I am storing a private file where I insist on no damage and plan to throw away the file entirely and restore from backup if even 1 bit is damaged, there is no sense in me using a small max_distance. You're welcome to make nutlint complain if max_distance is larger than 32k, but it would probably be good to avoid doing this if max_distance is still smaller than 2*average_framesize or so. Rich

Hi On Thu, Mar 02, 2006 at 12:31:16PM -0500, Rich Felker wrote:
On Thu, Mar 02, 2006 at 04:29:50PM +0100, Michael Niedermayer wrote:
Hi
On Thu, Mar 02, 2006 at 01:25:38PM +0200, Oded Shimon wrote: [...]
EOR frames MUST be zero-length and must be set keyframe. All streams SHOULD end with EOR, where the pts of the EOR indicates the end presentation time of the final frame. An EOR set stream is unset by the first content frames. EOR can only be unset in streams with zero decode_delay . + has_checksum must be set if the frame is larger then 2*max_distance or its
I still feel this should be a seperate variable, the only reason you gave so far against it is that poor demuxers won't be able to decide... And IMO that's a very poor argument...
every additional field adds complexity, what do we gain here with that field? personally i would fix max_distance and not store it at all, as i fear people will set it to random values
? if (frame_size > 2*nut->max_distance)
if (frame_size > sc->max_size)
I fail to see the added complexity...
As for what do we gain - more freedom to the muxer, and the ability of limiting for ex. audio to a very small size and not increasing overhead, and limiting video to slightly higher to avoid additional overhead...
agree but only if we store all these max_size/distance stuff in u(16) so they cannot be arbitrary large
NO! They really need to be arbitrarily large! If you have a file where each frame is at least 5 megs, max_distance is useless unless it's well over 5 megs!
are you drunk? if all frames are 5megs then max_distance==5mb will give you exactly the same file as any smaller max_distance, now if max_distance is set to 10mb you reduce the overhead by 1 syncpoint every 10mb, if its set to infinite you gain 2 syncpoints per 10mb and loose the ability to seek 1 syncpoint needs approximatly 8(startcode)+1(forw_ptr)+3(back_ptr)+4(pts) +4(crc) = 20byte are you seriously insisting on allowing the user to set max_distance higher then 5mb?! thats a maximum gain of 40byte per 10mb or 0.0003815% now i hope that the average windows kid will be smarter then rich and not come to the same conclusion that larger average frames somehow need larger max_distance, or otherwise we can forget the error recovering capabilities of nut in practice sorry but i wont continue these disscussions, expect me to fork if this continues, proposals should be judged on their effects (like overhead, complexity, error robustness, amount of memory or computations or delay , ...) but not on philosophical ones, nonexisting codecs, nonexisting demuxer architectures, nonexisiting kernels and so on my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures so either 0.0003815% is significnat that my header compression will go into the spec or arbirary sized max_distance leaves it, choose, but stop to change the rules for each thing depening upon if you like it or not [...] -- Michael

On Thu, Mar 02, 2006 at 08:00:42PM +0100, Michael Niedermayer wrote:
As for what do we gain - more freedom to the muxer, and the ability of limiting for ex. audio to a very small size and not increasing overhead, and limiting video to slightly higher to avoid additional overhead...
agree but only if we store all these max_size/distance stuff in u(16) so they cannot be arbitrary large
NO! They really need to be arbitrarily large! If you have a file where each frame is at least 5 megs, max_distance is useless unless it's well over 5 megs!
are you drunk? if all frames are 5megs then max_distance==5mb will give you exactly the same file as any smaller max_distance, now if max_distance is
It's not about size or overhead, but usefulness. Distance between syncpoints being > max_distance should be a special case, not the general case. The demuxer implementation in principle has to do more to check for validity in this case. Maybe all that can be eliminated and it's not such a big deal, but I'm still against putting particular physical units in NUT. Today 64k is large. 10-15 years from now it may be trivial.
now i hope that the average windows kid will be smarter then rich and not come to the same conclusion that larger average frames somehow need larger max_distance, or otherwise we can forget the error recovering capabilities of nut in practice
:)
sorry but i wont continue these disscussions, expect me to fork if this continues, proposals should be judged on their effects (like overhead, complexity, error robustness, amount of memory or computations or delay , ...) but not on philosophical ones, nonexisting codecs, nonexisting demuxer architectures, nonexisiting kernels and so on
Michael, please have some patience. When you spring a bunch of new things on us all the sudden, there will be resistence, just like when our roles were reversed with the per-stream back_ptr/pts stuff. Forking and ignoring the concerns of the other people involved is possible but it does much less to improve the overall design. At that time I was adamantly against calling a vote (even though I probably could have 'won' the vote, as you said) or doing other things to polarize the situation. This has always been about making NUT as good as possible, not about people's personal egos, and where my ideas have turned out to be bad I've abandoned them. I hope we can reasonably discuss the remaining issues you've raised and reach a consensus on what the best design is rather than flaming and forking and such.
my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures
Scratch kernel; the kernel architecture for it already exists. It's in POSIX and called posix_madvise. There is no demuxer to do zerocopy demuxing, but in the case where decoded frames fit in L2 cache easily, but the compressed frame is very large (i.e. high quality, high bitrate files -- the very ones where performance is a problem) zerocopy will make a significant improvement to performance. Sacrificing this to remove 1% codec overhead in crappy codecs is not a good tradeoff IMO. It would be easier to just make "MN custom MPEG4" codec that doesn't have the wasted bytes to begin with...
so either 0.0003815% is significnat that my header compression will go into the spec or arbirary sized max_distance leaves it, choose, but stop to change the rules for each thing depening upon if you like it or not
My objection to upper bound on max_distance had nothing to do with size. I'm sorry I wasn't clear. Rich

On Thu, Mar 02, 2006 at 05:16:23PM -0500, Rich Felker wrote:
are you drunk? if all frames are 5megs then max_distance==5mb will give you exactly the same file as any smaller max_distance, now if max_distance is
It's not about size or overhead, but usefulness. Distance between syncpoints being > max_distance should be a special case, not the general case. The demuxer implementation in principle has to do more to check for validity in this case. Maybe all that can be eliminated and it's not such a big deal, but I'm still against putting particular
To clarify: if it turns out that there's no demuxer issues with max_distance being significantly smaller than all frames, I'm sorry for making an issue out of this. If that's the case maybe we can make a rule that max_distance cannot be larger than some fixed upper limit, but IMO we should still leave it vlc-coded in case there's a need to support larger max_distance in NUT 2.0 or something (way in the future).
my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures
Scratch kernel; the kernel architecture for it already exists. It's in POSIX and called posix_madvise. There is no demuxer to do zerocopy demuxing, but in the case where decoded frames fit in L2 cache easily, but the compressed frame is very large (i.e. high quality, high bitrate files -- the very ones where performance is a problem) zerocopy will make a significant improvement to performance. Sacrificing this to remove 1% codec overhead in crappy codecs is not a good tradeoff IMO. It would be easier to just make "MN custom MPEG4" codec that doesn't have the wasted bytes to begin with...
One other thing with this that I forgot to mention: it would be possible to support zerocopy for non-"header-compressed" files even if header compression were supported. My reason for not wanting to have this option was that it forces any demuxer with zerocopy support to also have a duplicate demuxing system for the other case. If this can be shown not to be a problem (i.e. a trivial way to support both without significant additional code or slowdown) I'm not entirely opposed to the idea. Rich

Hi On Thu, Mar 02, 2006 at 06:11:17PM -0500, Rich Felker wrote: [...]
my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures
Scratch kernel; the kernel architecture for it already exists. It's in POSIX and called posix_madvise. There is no demuxer to do zerocopy demuxing, but in the case where decoded frames fit in L2 cache easily, but the compressed frame is very large (i.e. high quality, high bitrate files -- the very ones where performance is a problem) zerocopy will make a significant improvement to performance. Sacrificing this to remove 1% codec overhead in crappy codecs is not a good tradeoff IMO. It would be easier to just make "MN custom MPEG4" codec that doesn't have the wasted bytes to begin with...
One other thing with this that I forgot to mention: it would be possible to support zerocopy for non-"header-compressed" files even if header compression were supported. My reason for not wanting to have this option was that it forces any demuxer with zerocopy support to also have a duplicate demuxing system for the other case. If this can be shown not to be a problem (i.e. a trivial way to support both without significant additional code or slowdown) I'm not entirely opposed to the idea.
here are a few random problems you will have with this zero copy demuxing all solvable sure but its alot of work for very questionable gain * some bitstream readers in lavc have strict alignment requirements, frames cannot be aligned with zerocopy * the vlc decoding of all mpeg and h26x codecs in lavc needs a bunch of zero bytes at the end to gurantee error detection before segfaulting * several (not few) codecs write into the bitstream buffer either to fix big-little endian stuff or in at least one case reverse some lame obfuscation of a few bytes * having the bitstream initially not in the L2 cache (i think that would be the case if you read by dma/busmastering) will mean that accesses to the uncompressed frame and bitstream will be interleaved, todays ram is optimized for sequential access this making the already slowest part even slower * and yeah the whole buffer management with zerocopy will be a nightmare especially for a generic codec-muxer architecture where codec and muxer could run with a delay or on different threads basically my oppinion on this is that its like the video filter architecture very strict idealistic goals which may or may not be all achievable at the same time but which almost certainly will never be implemented as the code is to complex and too many things depend on too many [...] -- Michael

On Fri, Mar 03, 2006 at 01:12:55AM +0100, Michael Niedermayer wrote:
Hi
On Thu, Mar 02, 2006 at 06:11:17PM -0500, Rich Felker wrote: [...]
my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures
Scratch kernel; the kernel architecture for it already exists. It's in POSIX and called posix_madvise. There is no demuxer to do zerocopy demuxing, but in the case where decoded frames fit in L2 cache easily, but the compressed frame is very large (i.e. high quality, high bitrate files -- the very ones where performance is a problem) zerocopy will make a significant improvement to performance. Sacrificing this to remove 1% codec overhead in crappy codecs is not a good tradeoff IMO. It would be easier to just make "MN custom MPEG4" codec that doesn't have the wasted bytes to begin with...
One other thing with this that I forgot to mention: it would be possible to support zerocopy for non-"header-compressed" files even if header compression were supported. My reason for not wanting to have this option was that it forces any demuxer with zerocopy support to also have a duplicate demuxing system for the other case. If this can be shown not to be a problem (i.e. a trivial way to support both without significant additional code or slowdown) I'm not entirely opposed to the idea.
here are a few random problems you will have with this zero copy demuxing all solvable sure but its alot of work for very questionable gain
IMO the gain is not very questionable. Cutting out 25-50k of data that's moving through the cache per frame could make a significant difference to performance. And for rawvideo it could be even more extreme. (Naturally some filters will require alignment/aligned stride and thus copying, but direct playback should not.)
* some bitstream readers in lavc have strict alignment requirements, frames cannot be aligned with zerocopy
With a nice component system expressing alignment requirements, stride requirements, etc. for all frames and not treating decoded frames differently, this would be handled automatically. In any case, high-efficiency codecs have no word alignment (sometimes not even byte alignment?) so I doubt this is an issue for the ones that matter.
* the vlc decoding of all mpeg and h26x codecs in lavc needs a bunch of zero bytes at the end to gurantee error detection before segfaulting
:(
* several (not few) codecs write into the bitstream buffer either to fix big-little endian stuff or in at least one case reverse some lame obfuscation of a few bytes
This is probably a bad approach, for many reasons..
* having the bitstream initially not in the L2 cache (i think that would be the case if you read by dma/busmastering) will mean that accesses to the uncompressed frame and bitstream will be interleaved, todays ram is optimized for sequential access this making the already slowest part even slower
You can use prefetch instructions if needed.
* and yeah the whole buffer management with zerocopy will be a nightmare especially for a generic codec-muxer architecture where codec and muxer could run with a delay or on different threads
There is no buffer management on a 64bit system. You just mmap the whole file. For 32bit you'll have to lock things and update the map when you hit the address space limit.
basically my oppinion on this is that its like the video filter architecture very strict idealistic goals which may or may not be all achievable at the same time but which almost certainly will never be implemented as the code is to complex and too many things depend on too many
IMO it's easy to implement (easier than an efficient onecopy system) -- it's just a single mmap. The strange (mis)behavior by various codecs is problematic, but it could possibly be solved too. BTW even if the source is not mmapped, readonly memory, there are still optimizations to be made by the demuxer exporting the same memory that was passed into it, the same as MPI_EXPORT stuff in mplayer. Whether we'll actually do any of this in the near future is of course doubtful, but in the long term it should be possible, especially as codecs get more and more performance-intensive and we need to work harder and harder to squeeze out maximum performance. I don't want to preclude this with NUT. However, again, like I said, if you believe that it would be possible to support both ways without a performance/complexity penalty over the zerocopy-only implementation, I'm willing to reconsider your 'header compression' idea. Rich

Hi On Thu, Mar 02, 2006 at 07:36:25PM -0500, Rich Felker wrote:
On Fri, Mar 03, 2006 at 01:12:55AM +0100, Michael Niedermayer wrote:
Hi
On Thu, Mar 02, 2006 at 06:11:17PM -0500, Rich Felker wrote: [...]
my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures
Scratch kernel; the kernel architecture for it already exists. It's in POSIX and called posix_madvise. There is no demuxer to do zerocopy demuxing, but in the case where decoded frames fit in L2 cache easily, but the compressed frame is very large (i.e. high quality, high bitrate files -- the very ones where performance is a problem) zerocopy will make a significant improvement to performance. Sacrificing this to remove 1% codec overhead in crappy codecs is not a good tradeoff IMO. It would be easier to just make "MN custom MPEG4" codec that doesn't have the wasted bytes to begin with...
One other thing with this that I forgot to mention: it would be possible to support zerocopy for non-"header-compressed" files even if header compression were supported. My reason for not wanting to have this option was that it forces any demuxer with zerocopy support to also have a duplicate demuxing system for the other case. If this can be shown not to be a problem (i.e. a trivial way to support both without significant additional code or slowdown) I'm not entirely opposed to the idea.
here are a few random problems you will have with this zero copy demuxing all solvable sure but its alot of work for very questionable gain
IMO the gain is not very questionable. Cutting out 25-50k of data
rich your oppinion on how much gain something has is about as much correlated with reality as (sign(gain + TINY_VAL*random()) * HUGE_VAL) :) so i surely agree that there will be a gain in some cases, maybe most cases but i dont agree at all about its magnitude, IMHO its <1% which is not enough for the huge rewrite the world crussade for me not to mention the significantly higher complexity of the resulting architecture
that's moving through the cache per frame could make a significant difference to performance. And for rawvideo it could be even more extreme. (Naturally some filters will require alignment/aligned stride and thus copying, but direct playback should not.)
iam still in favor of fread() into the hw video buffer for rawvideo ... not to mention that rawvideo is a irrelevant and rare case where a few percent speed wont matter, if i seriously need fast rawvideo playback id write a small special propose player for it not rewrite a generic multimedia architecture to be able to handle it better
* some bitstream readers in lavc have strict alignment requirements, frames cannot be aligned with zerocopy
With a nice component system expressing alignment requirements, stride requirements, etc. for all frames and not treating decoded frames differently, this would be handled automatically. In any case, high-efficiency codecs have no word alignment (sometimes not even byte alignment?) so I doubt this is an issue for the ones that matter.
current lavc will segfault with almost all codecs an some cpus if you feed unaligned buffers into it, this can be fixed in lavc for most relatvely easily but it nicely shows how many people do such weird things, IMHO the whole zerocopy thing is idiotic, its like the singlethread player is always supperior rule, thers no question that fewer copies, fewer threads and less synchronization between threads is better, but its not like that could be changed in isolation, other things depend on it and the 1% you gain here might cause a 50% loss somewhre else [...]
* several (not few) codecs write into the bitstream buffer either to fix big-little endian stuff or in at least one case reverse some lame obfuscation of a few bytes
This is probably a bad approach, for many reasons..
i fully agree but its still the way its done currently ...
* having the bitstream initially not in the L2 cache (i think that would be the case if you read by dma/busmastering) will mean that accesses to the uncompressed frame and bitstream will be interleaved, todays ram is optimized for sequential access this making the already slowest part even slower
You can use prefetch instructions if needed.
wont help, and wont work (i tried this when playing with memcpy), one thing which would work is to do a dummy read pass over the bitstream buffer to force it into the cache, the difference to copying it into another spot then would be quite negligible, the code is limited by the mem speed, the writes wouldnt cost anything, only thing you loose is a little cache trashing, if that has any significance in practice is doubtfull IMO
* and yeah the whole buffer management with zerocopy will be a nightmare especially for a generic codec-muxer architecture where codec and muxer could run with a delay or on different threads
There is no buffer management on a 64bit system. You just mmap the whole file. For 32bit you'll have to lock things and update the map when you hit the address space limit.
you cant just update the map when you hit the end, some packets might still be in various buffers/queues, maybe a buffer in a muxer maybe a decoder, ... then there are non interleaved files and seeking in which cases a pure mmap variant on 32bit seems problematic but dont hesitate to implement it, after it exists, works and has been benchmarked and is faster i will happily demonstrate how header compression can be done without any speedloss i mean if we are already rewriting the whole demuxer architecture, fix 10 differnt "issues" in lavc whats the big problem with passing 2 bitstream buffes instead of one into the decoder? the first would be just the startcode and or header, so only the header parsing would need to use a slower bitstream reader ... [...] -- Michael

On Fri, Mar 03, 2006 at 03:09:15PM +0100, Michael Niedermayer wrote:
here are a few random problems you will have with this zero copy demuxing all solvable sure but its alot of work for very questionable gain
IMO the gain is not very questionable. Cutting out 25-50k of data
rich your oppinion on how much gain something has is about as much correlated with reality as (sign(gain + TINY_VAL*random()) * HUGE_VAL) :)
so i surely agree that there will be a gain in some cases, maybe most cases but i dont agree at all about its magnitude, IMHO its <1% which is not enough for the huge rewrite the world crussade for me
Critics say the same thing about my libc replacement, and then when I actually test, memory usage by typical apps drops by 75-90% and performance increases one-thousand-fold for some simple C functions. The glibc-lovers of course still won't shut up after the testing. They'll claim that 100k per process is "small" and does not matter even when it's 50-75% of the memory used, and that everyone should be using a 2GHz machine bought with ten-years'-wages if they want to be able to write in their own language.
not to mention the significantly higher complexity of the resulting architecture
In some ways the architecture is simpler than what we have now, which is full of hacks. In any case, a new architecture is NECESSARY for h264, since right now we're destroying performance by blitting a frame up to 5 or more frames after it was decoded, totally destroying the cache... :(
that's moving through the cache per frame could make a significant difference to performance. And for rawvideo it could be even more extreme. (Naturally some filters will require alignment/aligned stride and thus copying, but direct playback should not.)
iam still in favor of fread() into the hw video buffer for rawvideo ...
I hope you mean read(). fread() will inherently be very slow.
not to mention that rawvideo is a irrelevant and rare case where a few percent speed wont matter, if i seriously need fast rawvideo playback id write a small special propose player for it not rewrite a generic multimedia architecture to be able to handle it better
A good generic architecture will already support this as a consequence of other things it needs to support for performance.
current lavc will segfault with almost all codecs an some cpus if you feed unaligned buffers into it, this can be fixed in lavc for most relatvely easily but it nicely shows how many people do such weird things, IMHO the whole zerocopy thing is idiotic, its like the singlethread player is always supperior rule, thers no question that fewer copies, fewer threads and less synchronization between threads is better, but its not like that could be changed in isolation, other things depend on it and the 1% you gain here might cause a 50% loss somewhre else
Perhaps you'd like to demonstrate that mplayer is only 1% faster than the competition? Last I checked it was more like 10-200%, depending on which other player you're comparing to. Naturally this is not a result of being non-threaded by itself. It involves many factors, which include a reduction in the number of wasteful copies, lack of need for thread synchronization, and many other things I have no idea about. But you know as well as anyone else in ffmpeg development that many small things add up to huge performance advantage over the competition.
* having the bitstream initially not in the L2 cache (i think that would be the case if you read by dma/busmastering) will mean that accesses to the uncompressed frame and bitstream will be interleaved, todays ram is optimized for sequential access this making the already slowest part even slower
You can use prefetch instructions if needed.
wont help, and wont work (i tried this when playing with memcpy), one thing which would work is to do a dummy read pass over the bitstream buffer to force it into the cache, the difference to copying it into another spot then would be quite negligible, the code is limited by the mem speed, the writes wouldnt cost anything, only thing you loose is a little cache trashing, if that has any significance in practice is doubtfull IMO
IMO it can easily be tested. Just write 20-40k of random crap to some unused memory buffer while decoding a video that barely fits in cache and watch the change in performance.
* and yeah the whole buffer management with zerocopy will be a nightmare especially for a generic codec-muxer architecture where codec and muxer could run with a delay or on different threads
There is no buffer management on a 64bit system. You just mmap the whole file. For 32bit you'll have to lock things and update the map when you hit the address space limit.
you cant just update the map when you hit the end, some packets might still be in various buffers/queues, maybe a buffer in a muxer maybe a decoder, ...
All you need is a pointer keeping track of the earliest point in the stream still needed. You don't unmap the whole map, just the part before this point. It's a basic variable-size circular buffer implementation (which I was planning to support in a very generic way in a next-gen player), except with munmap/mmap instead of realloc. IIRC it's even possible to share this between threads/processes without any additional locking mechanisms.
then there are non interleaved files and seeking in which cases a pure mmap variant on 32bit seems problematic
No, non-interleaved case is easy. You simply treat it the same as -audiofile, i.e. open the file twice and treat the audio and video parts separately. This is needed anyway to make -cache work. The special-casing for non-interleaved AVI in MPlayer is a stupid hack.
but dont hesitate to implement it, after it exists, works and has been benchmarked and is faster i will happily demonstrate how header compression can be done without any speedloss i mean if we are already rewriting the whole demuxer architecture, fix 10 differnt "issues" in lavc whats the big problem with passing 2 bitstream buffes instead of one into the decoder? the first would be just the startcode and or header, so only the header parsing would need to use a slower bitstream reader ...
The intent was not to modify the codecs with hacks to support this specially, but to find a way to make sure 'onecopy on top of zerocopy' is as fast as ordinary demuxing with a single copy. The problem is that if a demuxer implements zerocopy, its input buffer might not actually be the filesystem cache buffer, depending on the player's implementation. It may have already been copied once. This does not incur a performance penalty if the demuxer guarantees that it will always output the same buffer that was given to it, but it is a problem if the demuxer might copy it into another third buffer. Maybe it's acceptable just to have the additional overhead in the demuxer for treating the two cases separately. Anyway, I don't want to argue and flame over this. If you really want the header compression, please either come up with a good solution or just say that you insist on deferring that question until someone implements zerocopy. If the latter is the case, then go ahead and do it. It's ok with me. Just please don't bash and flame the whole zerocopy concept which is not the subject at hand. Rich

Hi On Fri, Mar 03, 2006 at 10:28:50AM -0500, Rich Felker wrote: [...]
current lavc will segfault with almost all codecs an some cpus if you feed unaligned buffers into it, this can be fixed in lavc for most relatvely easily but it nicely shows how many people do such weird things, IMHO the whole zerocopy thing is idiotic, its like the singlethread player is always supperior rule, thers no question that fewer copies, fewer threads and less synchronization between threads is better, but its not like that could be changed in isolation, other things depend on it and the 1% you gain here might cause a 50% loss somewhre else
Perhaps you'd like to demonstrate that mplayer is only 1% faster than the competition? Last I checked it was more like 10-200%, depending on which other player you're comparing to. Naturally this is not a result of being non-threaded by itself. It involves many factors, which include a reduction in the number of wasteful copies, lack of need for thread synchronization, and many other things I have no idea about. But you know as well as anyone else in ffmpeg development that many small things add up to huge performance advantage over the competition.
mplayer might beat xine, ffplay and others by 1-200% in raw performance, but guess which player did i use yesterday to watch the 4th episode or AIR? ffplay, why? because mplayers video output was not fluid/smooth and no messing with its options helped, yes xine was almost as good as ffplay, mplayer was far behind and yes the file was not local on my hd (no space) but on another computer and i viewed it over shfs which isnt the most bugfree thing so you can argue thats the reason why but still in practice the 1-200% where not what mattered ... [...]
then there are non interleaved files and seeking in which cases a pure mmap variant on 32bit seems problematic
No, non-interleaved case is easy. You simply treat it the same as -audiofile, i.e. open the file twice and treat the audio and video parts separately. This is needed anyway to make -cache work. The special-casing for non-interleaved AVI in MPlayer is a stupid hack.
there might be more then 2 streams in a non interleaved file ... and theres some wasted memory in case of double opens (index for example) and theres the issue with seeking independant of interleaving how do you do a binary search with that single buffer thing [...] -- Michael

Hi On Thu, Mar 02, 2006 at 05:16:23PM -0500, Rich Felker wrote: [...]
sorry but i wont continue these disscussions, expect me to fork if this continues, proposals should be judged on their effects (like overhead, complexity, error robustness, amount of memory or computations or delay , ...) but not on philosophical ones, nonexisting codecs, nonexisting demuxer architectures, nonexisiting kernels and so on
Michael, please have some patience. When you spring a bunch of new things on us all the sudden, there will be resistence, just like when our roles were reversed with the per-stream back_ptr/pts stuff. Forking and ignoring the concerns of the other people involved is possible but it does much less to improve the overall design. At that time I was adamantly against calling a vote (even though I probably could have 'won' the vote, as you said) or doing other things to polarize the situation. This has always been about making NUT as good as possible, not about people's personal egos, and where my ideas have turned out to be bad I've abandoned them. I hope we can reasonably discuss the remaining issues you've raised and reach a consensus on what the best design is rather than flaming and forking and such.
you ask someone else to stop flaming? ;)
my proposed header compression, which has negligible complexity would reduce the overhead by ~1% and was rejected based on nonexistant kernel and demuxer architectures
Scratch kernel; the kernel architecture for it already exists. It's in POSIX and called posix_madvise. There is no demuxer to do zerocopy demuxing, but in the case where decoded frames fit in L2 cache easily, but the compressed frame is very large (i.e. high quality, high bitrate files -- the very ones where performance is a problem) zerocopy will make a significant improvement to performance. Sacrificing this to remove 1% codec overhead in crappy codecs is not a good tradeoff IMO. It would be easier to just make "MN custom MPEG4" codec that doesn't have the wasted bytes to begin with...
this reminds me of my little mpeg1 experiment, replacing the entropy coder with an ac coder gave ~10% lower bitrate ...
so either 0.0003815% is significnat that my header compression will go into the spec or arbirary sized max_distance leaves it, choose, but stop to change the rules for each thing depening upon if you like it or not
My objection to upper bound on max_distance had nothing to do with size. I'm sorry I wasn't clear.
and what is the terrible thing which will/could happen if there where an upper limit? i mean from the user or developer perspecitve (complexity, speed, overhead, ...) not the idealist/philosopher perspective (its wrong, bad design, this is like <infamous container> does it, it will make <extreemly rare use case> slightly slower, ...) [...] -- Michael

On Fri, Mar 03, 2006 at 12:30:22AM +0100, Michael Niedermayer wrote:
On Thu, Mar 02, 2006 at 05:16:23PM -0500, Rich Felker wrote:
My objection to upper bound on max_distance had nothing to do with size. I'm sorry I wasn't clear.
and what is the terrible thing which will/could happen if there where an upper limit? i mean from the user or developer perspecitve (complexity, speed, overhead, ...)
I could just as well ask - what does the user/developer GAIN from there being an upper bound? yes, if there is no upper bound you can make a broken file, but you can ALWAYS make a broken/inefficient file if you are being deliberately ignorant...
not the idealist/philosopher perspective (its wrong, bad design, this is like <infamous container> does it, it will make <extreemly rare use case> slightly slower, ...)
To this same sentiment, I ask, what is gained from your new index system? You never replied to my rant.. Yes you did make it optional which makes it somewhat nicer, I am still annoyed by the index_ptr and the additional demuxer complexity for this index system (reallocing the syncpoint cache for each index chunk, reading several chunks...), mostly because I see absoloutely no gain. There's a pretty damn good chance that if the index is borked, the entire index is borked. not just a small piece of it. And the file still plays perfectly fine without the index... ("it will make <extreemly rare use case> slightly slower" ...) - ods15

Hi On Fri, Mar 03, 2006 at 01:11:50PM +0200, Oded Shimon wrote:
On Fri, Mar 03, 2006 at 12:30:22AM +0100, Michael Niedermayer wrote:
On Thu, Mar 02, 2006 at 05:16:23PM -0500, Rich Felker wrote:
My objection to upper bound on max_distance had nothing to do with size. I'm sorry I wasn't clear.
and what is the terrible thing which will/could happen if there where an upper limit? i mean from the user or developer perspecitve (complexity, speed, overhead, ...)
I could just as well ask - what does the user/developer GAIN from there being an upper bound?
yes you can and should ask, it makes it much harder to generate broken files
yes, if there is no upper bound you can make a broken file, but you can ALWAYS make a broken/inefficient file if you are being deliberately ignorant...
master we have 2 issues, lets call them A and B without fixing A the user can make a broken file without fixing B the user can make a broken file too now fixing A is a bad idea as the user still can ALWAYS make a broken/ inefficient file if he is deliberately ignorant (by using B) now fixing B is a bad idea as the user still can ALWAYS make a broken/ inefficient file if he is deliberately ignorant (by using A) so what do we do, yes we fix neither i like your arguments, they are amusing
not the idealist/philosopher perspective (its wrong, bad design, this is like <infamous container> does it, it will make <extreemly rare use case> slightly slower, ...)
To this same sentiment, I ask, what is gained from your new index system? You never replied to my rant.. Yes you did make it optional which makes it
error robustness and easier muxing for muxers which are not based on libnut
somewhat nicer, I am still annoyed by the index_ptr and the additional demuxer complexity for this index system (reallocing the syncpoint cache for each index chunk, reading several chunks...), mostly because I see
you need that for dynamic index building too if theres no index at all, and iam not against adding a syncpoint_count field which would provide you with the number of syncpoints in the whole index
absoloutely no gain. There's a pretty damn good chance that if the index is borked, the entire index is borked. not just a small piece of it. And the file still plays perfectly fine without the index... ("it will make <extreemly rare use case> slightly slower" ...)
its not slightly, and its not rare IMHO [...] -- Michael

Hi On Fri, Mar 03, 2006 at 03:30:50PM +0100, Michael Niedermayer wrote: [...]
not the idealist/philosopher perspective (its wrong, bad design, this is like <infamous container> does it, it will make <extreemly rare use case> slightly slower, ...)
To this same sentiment, I ask, what is gained from your new index system? You never replied to my rant.. Yes you did make it optional which makes it
error robustness and easier muxing for muxers which are not based on libnut
somewhat nicer, I am still annoyed by the index_ptr and the additional demuxer complexity for this index system (reallocing the syncpoint cache for each index chunk, reading several chunks...), mostly because I see
you need that for dynamic index building too if theres no index at all, and iam not against adding a syncpoint_count field which would provide you with the number of syncpoints in the whole index
absoloutely no gain. There's a pretty damn good chance that if the index is borked, the entire index is borked. not just a small piece of it. And the file still plays perfectly fine without the index... ("it will make <extreemly rare use case> slightly slower" ...)
its not slightly, and its not rare IMHO
but if you and rich prefer a single monolithic index then well lets return to that, its not the most critical issue ... [...] -- Michael

On Fri, Mar 03, 2006 at 03:44:18PM +0100, Michael Niedermayer wrote:
absoloutely no gain. There's a pretty damn good chance that if the index is borked, the entire index is borked. not just a small piece of it. And the file still plays perfectly fine without the index... ("it will make <extreemly rare use case> slightly slower" ...)
its not slightly, and its not rare IMHO
but if you and rich prefer a single monolithic index then well lets return to that, its not the most critical issue ...
I prefer a monolithic index, but more importantly I want it to be writable in one pass without buffering. However, in order to write an index you need to have buffered all the data that goes into the index while writing the file, and if it was possible to buffer all that then it should be possible to reuse that memory (if nothing else) to store the whole nut-format index. Also, you could do a dummy-run first, writing the index to /dev/null to get its size, then once you know the size do the actual unbuffered write. This is similar to my recommended replacement for GNU libc's nonstandard asprintf: call snprintf once to get length, malloc, then snprintf again to store the string. Anyway, if there are good reasons that the broken-up index is better than monolithic, I'm happy to consider it. However, I believe the possibility of one index segment being broken and the rest being ok is extremely rare, like Oded said, and it's also (slightly) complicated for a demuxer to decide what to use the partial index for in this case anyway. Rich

Hi On Fri, Mar 03, 2006 at 10:36:25AM -0500, Rich Felker wrote:
On Fri, Mar 03, 2006 at 03:44:18PM +0100, Michael Niedermayer wrote:
absoloutely no gain. There's a pretty damn good chance that if the index is borked, the entire index is borked. not just a small piece of it. And the file still plays perfectly fine without the index... ("it will make <extreemly rare use case> slightly slower" ...)
its not slightly, and its not rare IMHO
but if you and rich prefer a single monolithic index then well lets return to that, its not the most critical issue ...
I prefer a monolithic index, but more importantly I want it to be writable in one pass without buffering. However, in order to write an index you need to have buffered all the data that goes into the index while writing the file, and if it was possible to buffer all that then it should be possible to reuse that memory (if nothing else) to store the whole nut-format index.
Also, you could do a dummy-run first, writing the index to /dev/null to get its size, then once you know the size do the actual unbuffered write. This is similar to my recommended replacement for GNU libc's nonstandard asprintf: call snprintf once to get length, malloc, then snprintf again to store the string.
Anyway, if there are good reasons that the broken-up index is better than monolithic, I'm happy to consider it. However, I believe the possibility of one index segment being broken and the rest being ok is extremely rare, like Oded said, and it's also (slightly) complicated for a demuxer to decide what to use the partial index for in this case anyway.
ok, fine, lets go back to the monolithic index ill change it back after the flags cleanup unless someone else does it before... [...] -- Michael

On Thu, Mar 02, 2006 at 11:55:41AM +0100, Michael Niedermayer wrote:
@@ -474,12 +508,15 @@ 1 is_key if set, frame is keyframe 2 end_of_relevance if set, stream has no relevance on presentation. (EOR) + 4 has_checksum if set then the frame header contains a checksum
Like I said, this should be a NUT flag...
and i disagree
Putting it here is blatently wrong since it applies to the nut data structure, not the semantics of the frame. Also I'm still against this. The system with integrating the checksum with the syncpoint was more robust against errors and less complicated. Rich
participants (3)
-
Michael Niedermayer
-
Oded Shimon
-
Rich Felker