[MPlayer-dev-eng] More on timestamps in NUT

Mon May 3 07:01:31 CEST 2004

Earlier I made a rather hasty proposal about timestamp handling in
NUT. I've been thinking about stuff more and working out some
theoretical overhead calculations, and now I have something more
substantial.

First, an explanation of the problem:

Michael noted on the cvslog list (in a thread originating from some
changes he made to mpcf.txt, the in-progress NUT spec) some
unfortunate consquences of the current timestamp coding system in NUT.
In the working spec, the muxer and demuxer keep track of the three
most recent timestamp deltas for each stream, and then subsequent
frames can be coded to reuse these deltas, rather than storing the lsb
timestamp or full timestamp explicitly. This is particularly useful
for video streams with B frames (which will have a timestamp pattern
of +N+1, -N, +1 (N times), where N is the number of B frames, and for
vorbis audio where there are 3 size frames (128, 576, and 1024), and
it allows most packet headers to be encoded in a single byte for
very-low-bitrate streams (where overhead matters a lot).

The problem arises with error resilience. I had been supposing that
after an error, we could resync to the next valid packet using some
nice tricks I worked out (which are explained in another thread). But
Michael pointed out that such damage can mess up the timestamp delta
prediction entirely, leading to completely bogus timestamps. Thus, it
seems that in the presence of timestamp delta predictors, error
resilience can only recover at the next lsb-coded or fully-coded
timestamp.

It is important to note that the problem is not delta-coded timestamps
in general, but delta-PREDICTED timestamps. If timestamps are coded
with fixed deltas, broken packets could lead to slight A/V desync
until the next lsb-coded or fully-coded timestamp, but the error will
not grow. On the other hand, delta-predicted timestamps were put in
the spec for a reason: while they are no more efficient than having a
fixed table of possible deltas in the global header, they have the
advantage that the muxer writing the file does not need to have any
advance knowledge about the codec. On the other hand, a fixed delta
table of deltas CAN be more efficient.

Recall the "optimal" framecode table for single-stream low-bitrate
vorbis audio:

171 codes with{
    keyframe=1
    stream_id=0
    pts= 00,01,10 (3 delta predictors)
    size_mul=230
    data_size_coded=0
    data_size_lsb= 170..226
}
66 codes with{
    keyframe=1
    stream_id=0
    pts= 00,01,10 (3 delta predictors)
    size_mul=52
    data_size_coded=0
    data_size_lsb= 30..51
}
16 codes with{
    keyframe=1
    stream_id=0
    pts= 00,01,10,11 (3 delta predictors, plus vlc)
    size_mul=4
    data_size_coded=1
    data_size_lsb= 0..3
}

Note that we waste lots of framecodes, because the delta predictors
switch around in order according to which was most/least recently
used, and thus for each size we have to have a framecode for delta0,
delta1, and delta2. If we instead had fixed deltas...

120 codes with{
    keyframe=1
    stream_id=0
    pts=+1024
    size_mul=261
    data_size_coded=0
    data_size_lsb= 141..260
}
80 codes with{
    keyframe=1
    stream_id=0
    pts=+576
    size_mul=141
    data_size_coded=0
    data_size_lsb= 61..140
}
50 codes with{
    keyframe=1
    stream_id=0
    pts=+128
    size_mul=71
    data_size_coded=0
    data_size_lsb= 21..70
}
4 codes with{
    keyframe=1
    stream_id=0
    pts=vlc
    size_mul=4
    data_size_coded=1
    data_size_lsb= 0..3
}

.....and every frame is basically guaranteed a 1-byte header.

We have seen that fixed timestamp deltas provide more efficient use of
the framecode table and do not interfere with error resilience.
Further, they will allow efficient storage of future audio codecs that
might need more than three possible deltas, and likewise future video
codecs that might have a more complicated frame reordering. The only
disadvantage of fixed deltas is that the muxer must be preprogrammed
with the list of possible deltas. However, even with a naive framecode
table that does not use any timestamp deltas, the overhead of NUT is
considerably lower than most or all other containers. For special
applications where ultra-low overhead is needed (such as streaming
vorbis audio at 8-24 kbit/sec), it is not unreasonable to expect the
user to load a custom framecode table into the muxer to optimize the
overhead.

THEREFORE, I propose that we remove timestamp delta prediction from
NUT, and put in its place fixed timestamp deltas in the framecode
table. (As a less radical proposal, we could choose to support both
and strongly recommend that delta prediction NOT be used.)

On to the next topic...

The current draft of the spec calls for each stream to fully code the
timestamp in its next frame after a type-2 startcode. Depending on the
time base units in use, this results in an overhead of at least 8+3*N
bytes per startcode, where N is the number of streams (for a 1-2 hour
movie, you need at least 3 bytes to store a timestamp, or worse if you
choose/need a bad time base). For error resilience purposes, it may be
desirable to put startcodes fairly frequently, and at low bitrates
this could result in considerable overhead.

If we have lots of streams, it seems redundant to code the full
timestamps for ALL of them. In fact, all the timestamps should be
approximately the same, due to proper interleaving. Let's take
advantage of that redundancy by storing a single timestamp with the
startcode, and calling the startcode+timestamp unit a "sync point".
This timestamp can be in a global time base, specified in the global
header.

Having a global time base (that may, but doesn't necessarily,
correspond to one or more of the stream time bases) has an additional
advantage, in that we can use it for indexing. Our intent has been
that the index would point to startcodes anyway, rather than packets
of a particular stream, so it makes sense for the startcodes to have
their own timestamps. When a demuxer encounters a sync point, it would
convert the associated timestamp into the separate time bases of each
stream, and consider future delta/lsb timestamps to be relative to
that time.

Note that under this system, the overhead from startcodes is reduced
from 8+3*N (or more) to something like 11+N. Or, in the case of a
naive muxer that is always coding the lsb timestamps on each frame,
the additional overhead for a sync point is only a constant 11 bytes
(or 10 or 12, depending on the time base)!

THEREFORE, I propose that we replace the type-2 startcodes with a sync
point packet containing its own timestamp, specify that subsequent
relative/lsb timestamps are based on the sync point timestamp, and
remove the requirement that frames following a type-2 startcode fully
code their timestamps.

Finally, I want to get "type-1" start codes into the spec. We've been
talking about them long enough, but they're still not there.
Basically, "type-1" startcodes, which I'll call recovery points, are
short 3-byte startcodes used to aid in error recovery. The reason they
can be short is that they do not need to effectively avoid collisions,
because unlike sync points they will not be used for seeking. Instead
they are used after an invalid packet is decoded in order to find the
next valid packet. Demuxers that don't want perfect error resilience
can just resume at the next recovery point or sync point, while
"hardcore" demuxers can use a brute force approach of testing each
byte to see if it is the start of a chain leading up to the next
recovery or sync point.

THEREFORE, I propose that we add optional 3-byte recovery points to
the nut spec, which muxers can use at their discression to improve the
error resilience of the file. (Discussion point: perhaps 2 bytes is
enough?)

To summarize, my proposal is:

1. Add to the global header a time base used for sync points and for
   the (optional) index.

2. Replace the basic "type 2" startcode with a "sync point" packet.
   This packet consists of the 8-byte startcode and a vlc timestamp
   field given in the global time base unit.

3. The first frame of each stream after a sync point will use the
   timestamp of the sync point (converted into the stream's time base,
   with rounding done by truncation) as its base for relative
   timestamps.

4. Remove timestamp delta prediction entirely.

5. Allow frame codes to specify a fixed timestamp delta.

6. Add recovery points (type-1 startcodes).

Please feel free to argue against any of these proposals if you think
they're bad. However, I believe they all help to sanitize nut and
greatly improve the error resilience, and while they may increase
overhead in the 'naive' case where the muxer is not well-configured
for the particular job it's doing, they actually reduce overhead when
the muxer is loaded with a good framecode table.

Rich