!!! DRAFT DRAFT DRAFT !!!

DRAFT USAGE / SEMANTICS / RATIONALE SECTIONS FOR NUT SPEC


Overview of NUT

Unlike many popular containers, a NUT file can largely be viewed as a
byte stream, as opposed to having a global block structure. NUT files
consist of a sequence of packets, which can contain global headers,
file metadata, stream headers for the individual media streams,
optional index data to accelerate seeking, and, of course, the actual
encoded media frames. Aside from frames, all packets begin with a
64-bit startcode, the first byte of which is 0x4E, the ASCII character
'N'. In addition to identifying the type of packet to follow, these
startcodes (combined with CRC) allow for reliable resynchronization
when reading damaged or incomplete files. Packets have a common
structure that enables a process reading the file both to verify
packet contents and to bypass uninteresting packets without having to
be aware of the specific packet type.

In order to facilitate identification and playback of NUT files,
strict rules are imposed on the location and order of packets and
streams. Streams can be of class video, audio, subtitle, or
user-defined data. Additional classes may be added in a later version
of the NUT specification. Streams must be numbered consecutively
beginning from 0. This allows simple and compact reference to streams
in packet types where overhead must be kept to a minimum.

Header Structure

A NUT file must begin with a magic identification string, followed by
the main header and a stream header for each stream, ordered by stream
id. No other packets may intervene between these header packets. For
robustness, a NUT file needs to include backup copies of the headers.
In the absence of valid headers at the beginning of the file,
processes attempting to read a NUT file are recommended to search for
backup headers beginning at each power-of-two byte offset in the file.
Simple stop conditions are provided to ensure that this search
algorithm is bounded logarithmically in file length.

Metadata - Info Packets

The NUT main header and stream headers may be followed by metadata
"info" packets, which contain (mostly textual, but other formats are
possible) information on the file, on particular streams, or on
particular time intervals ("chapters") of the file, such as: title,
author, language, etc. One should not that info packets may occur at
other locations in a file, particulatly in a file that is being
generated/transmitted in real time; however, a process interpreting a
NUT file should not make any attempt to search for info packets except
in their usual location, i.e. following the headers. It is intended
that processes presenting the contents of a NUT file will make
automated responses to information stored in these packets, e.g.
selecting a subtitle language based on the user's preferred list of
languages, or providing a visual list of chapters to the user.
Therefore, the format of info packets and the data they are to contain
has been carefully specified and is aligned with International
Standards for language codes and so forth. For this reason it is also
important that info packets be stored in the correct locations, so
that processes making automated responses to these packets can operate
correctly.

Index

An index packet to facilitate O(1) seek-to-time operations may follow
the headers. If an index packet does exist here, it should be placed
after info packets, rather than before. Since the contents of the
index depend on knowing the complete contents of the file, most
processes generating NUT files are not expected to store an index with
the headers. This option is merely provided for applications where it
makes sense, to allow the index to be read without any seek operations
on the underlying media when it is available.

On the other hand, all NUT files except live streams (which have no
concept of "end of file") must include an index at the end of the
file, followed by a fixed-size 32-bit integer that is an offset
backwards from end-of-file at which the final index packet begins.
This is the only fixed-size field specified by NUT, and makes it
possible to locate an index stored at the end of the file without
resorting to unreliable heuristics.

Streams

A NUT file consists of one or more streams, intended to be presented
simultaneously in synchronization with one another. Use of streams as
independent entities is discouraged, and the nature of NUT's ordering
requirements on frames makes it highly disadvantageous to store
anything except the audio/video/subtitle/etc. components of a single
presentation together in a single NUT file. Nonlinear playback order,
scripting, and such are topics outside the scope of NUT, and should be
handled at a higher protocol layer should they be desired (for
example, using several NUT files with an external script file to
control their playback in combination).

With each stream, a single media encoding format is associated. The
stream headers convey properties of the encoding, such as video frame
dimensions, sample rates, and the compression standard ("codec") used
(if any). Stream headers may also carry with them an opaque, binary
object in a codec-specific format, containing global parameters for
the stream such as codebooks. Both the compression format and whatever
parameters are stored in the stream header (including NUT fields and
the opaque global header object) are constant for the duration of the
stream.

Frames

NUT is built on the model that video, audio, and subtitle streams all
consist of a sequence of "frames", where the specific definition of
frame is left partly to the codec, but should be roughly interpreted
as the smallest unit of data which can be decoded (not necessarily
independently; it may depend on previously-decoded frames) to a
complete presentation unit occupying an interval of time. In
particular, video frames correspond to the usual idea of a frame as a
picture that is displayed beginning at its assigned timestamp until it
is replaced by a subsequent picture with a later timestamp. Subtitle
frames should be thought of as individual subtitles in the case of
simple text-only streams, or as events that alter the presentation in
the case of more advanced subtitle formats. Audio frames are merely
intervals of samples; their length is determined by the compression
format used.

Frames need not be decoded in their presentation order. NUT allows for
arbitrary out-of-order frame systems, from classic MPEG-1-style B
frames to H.264 B pyramid and beyond, using a simple notion of "delay"
and an implicitly-determined "decode timestamp" (dts). Out-of-order
decoding is not limited to video streams; it is available to audio
streams as well, and, given the right conditions, even subtitle
streams, should a subtitle format choose to make use of such a
capability.

Central to NUT is the notion that EVERY frame has a timestamp. This
differs from other major container formats which allow timestamps to
be omitted for some or even most frames. The decision to explicitly
timestamp each frame allows for powerful high-level seeking and
editing in applications without any interaction with the codec level.
This makes it possible to develop applications which are completely
unaware of the codecs used, and allows applications which do need to
perform decoding to be more properly factored.

Keyframes

NUT defines a "key frame" as any frame such that the frame itself and
all subsequent (with regard to presentation time) frames of the stream
can be decoded successfully without reference to prior (with regard to
storage/decoding order) frames in the stream. This definition may
sometimes be bent on a per-codec basis, particularly with audio
formats where there is MDCT window overlap or similar.

The concept of key frames is central to seeking, and key frames will
be the targets of the seek-to-time operation.

Representation of Time

NUT represents all timestamps as exact integer multiples of a rational
number "time base". Files can have multiple time bases in order to
accurately represent the time units of each stream. The set of
available time bases is defined in the main header, while each stream
header indicates which time base the corresponding stream will use.

Effective use of time bases both allows for compact representation of
timestamps, minimizing overhead, and enriches the information
contained in the file. For example, a process interpreting a NUT file
with a video time base of 1/25 second knows it can convert the video
to fixed-framerate 25 fps content or present it faithfully on a PAL
display.

The scope of the media contained in a NUT file is a single contiguous
interval of time. Timestamps need not begin at zero, but they may not
jump backwards. Any large forward jump in timestamps must be
interpreted as a frame with a large presentation interval, not as a
discontinuity in the presentation. Without conditions such as these,
NUT could not guarantee correct seeking in efficient time bounds.

Aside from provisions made for out-of-order decoding, all frames in a
NUT file must be strictly ordered by timestamp. For the purpose of
sorting frames, all timestamps are treated as rational numbers derived
from a coded integer timestamp and the associated time base, and
compared under the standard ordering on the rational numbers.

Frame Coding

Each frame begins with a "framecode", a single byte which indexes a
table in the main header. This table can associate properties such as
stream id, size, relative timestamp, keyframe flag, etc. with the
frame that follows, or allow the values to be explicitly coded
following the framecode byte. By careful construction of the framecode
table in the main header, an average overhead of significantly less
than 2 bytes per frame can be achieved for single-stream files at low
bitrates.

Syncpoints

...