[MPlayer-G2-dev] more on g2 & video filters

Sun Sep 28 02:02:04 CEST 2003

Here's an updated and more thorough g2 video layer design.

For consistency and for the sake of having a name to talk about the
system, I'll call it "video pipeline" (vp). This encompasses filters
as well as decoders, encoders, vo's, and the glue layer between it
all.

First, the structure of connection between the pieces. Nodes of the
video pipeline can (in theory) be connected in many different ways. A
simple implementation for the time being could be entirely linear like
G1's filter chain, but the design should not require this. Thus, we'll
talk about the video pipeline as a collection of nodes and links,
where a link consists of a source, a destination, and a link structure
which ties the two together and assists in managing buffers.

----------------------------------------------------------------------

The first topic, and probably the most important, is buffer
management. G1 did a remarkably good job compared to any other player
at the time, but it still has some big limitations. In particular:

1. There's no clear rule on what a filter is allowed to do with an
   image after calling vf_put_image on it. Can it still read the
   contents? Can it write more? Can it call vf_put_image more than
   once on the same mpi without calling vf_get_image again? In general
   the answer is probably no, but several filters (including ones I
   wrote) do stuff like this, and it's just not clear what's ok and
   what's not.

2. A filter that gives out DR buffers from its get_image has no way of
   knowing when the caller is done with those buffers. In theory,
   put_image should be a good indication (but see (1) above), and even
   worse, if the previous filter/dec_video drops frames, then
   put_image will never be called.

3. A video decoder (or filter) has no way of telling the system how
   long it needs its buffers preserved (for prediction or whatever).
   This works ok with standard IP[B] type codecs, but with more
   complicated prediction models it might totally break.

So here's the new buffer model, based on get_buffer/release_buffer and
reference counts:

When a node of the video pipeline wants a buffer to return as output
from its pull_frame (see next section below), it has three options for
the buffer type: export, indirect, and direct. The first two are
always available, but direct it only available if the destination's
get_buffer function is willing to allocate a buffer with the desired
format and flags (similar to G1). All buffers are associated with the
link structure.

Export -- almost exactly like in G1, with a few improvements. In the
export case, the source filter is considered the owner of the buffer.
It will be notified when the buffer's reference count reaches zero, so
that it can in turn release any buffer it might be re-exporting (for
example, the source buffer of which vf_crop is exporting a cropped
version).

Direct -- destination sets up a buffer structure so that source can
render directly into it. In this case, the destination is considered
the owner of the buffer, and is notified when the buffer's reference
count reaches zero, so that it can in turn release any buffer it might
be using (for example, the full destination buffer, a small part of
which vf_expand is making available to the source).

Indirect -- allocated and managed by the link layer.

The new video pipeline design also has certain flags analogous to the
old image types and flags in G1:

Readable -- the buffer cannot reside in write-only memory, slow video
memory, or anywhere that makes reading it slow, difficult, or
restricted. This should always be set correctly when requesting a
buffer, even though it generally applies only to direct-type buffers.

Preserve -- source and rely on destination not to clobber the buffer
as long as it is valid. If destination is the owner of the buffer
(direct-type), then it it still of course free to clobber the buffer
after the reference count reaches zero.

Reusable -- source is free to continue writing to buffer even after
passing it on to destination (assuming it maintains a reference count)
and to pass the same buffer to destination multiple times if desired.
Note that as long as the reusable flag is NOT set, destination can
rely on source not to clobber the buffer after source returns (the
analogue of the preserve flag, in the reverse direction).

One should be particularly aware that the preserve flag applies to ALL
image type, not just direct and indirect. That means that, unless
source sets the preserve flag on exported buffers, destination is free
to clobber them. (One example where this is useful is for rendering
OSD onto the exported buffer of a filter before copying to video
memory, instead of having to alpha-blend OSD in video memory.)

Now an overview of how to convert old G1-model filters/codecs to the
new model:

IP[B] codecs -- call vp_get_buffer with readable+preserve flags for I
and P frames, no flags for B frames. Increment reference count for I/P
frames (vp_lock_buffer) before returning, then release them
(vp_release_buffer) when they're no longer needed for prediction. For
standard IP model this just involves keeping one buffer pointer in the
codec's private data area (the previous I/P frame).

Filters and codecs that used the "static" buffer type in G1 -- on the
first frame, call vp_get_buffer with preserve+reusable (and optionally
readable) flags to get a buffer, then establish a lock
(vp_lock_buffer) before returning the image to the caller so that the
reference count does not reach zero. When rendering subsequent frames,
don't call vp_get_buffer again; just increment the reference count
(vp_lock_buffer) before returning so that destination has an extra
reference to release without the count reaching zero.

I-only codecs and filters that use temp buffers -- call vp_get_buffer
with no flags and return the buffer after drawing into it.

This pretty much covers the G1 cases. Of course there are many more
possibilities in G2 which weren't allowed in G1 and thus don't
correspond to any old buffer model.

----------------------------------------------------------------------

The second topic is flow of execution.

>From a final destination (vo/ve), the pipeline is called in reverse
order, using a "pull" model for obtaining frames. The main relevant
function is vp_pull_frame, which takes as its argument a pointer to a
link structure, and calls the source's pull_frame function asking for
a frame for destination.

A filter/codec's pull_frame, in turn, is responsible for obtaining a
buffer (via vp_get_buffer) filling it with the picture, and returning
it to the caller. 

The reader would be advised to read and study the following example:

Filter chain:

VD --L1--> Filter A --L2--> Filter B --L3--> VO

Let's say filter A is crop, exporting image, and B is scale, direct
rendering into VO's video memory. L1,L2,L3 are the link structures.

Flow of execution:

vp_pull_frame(L3)
  B->pull_frame(L3)
    sbuf=vp_pull_frame(L2)
      A->pull_frame(L2)
        sbuf=vp_pull_frame(L1)
          VD->pull_frame(L1)
            figure out video format, dimensions, etc.
            A->query_format [*1]
              B->query_format
            A->config
              B->config
                VO->query_format [*2]
                VO->config
            dbuf=vp_get_buffer(L1)
              A->get_buffer(L1)
                dr fails, return NULL
              setup and return indirect image
            VD decodes video into dbuf
            return dbuf
        dbuf=vp_get_buffer(L2,export)
        setup export strides/pointers
        dbuf->priv->source=sbuf [*3]
        return dbuf
    dbuf=vp_get_buffer(L3)
      VO->get_buffer(L3)
        setup dr buffer and return it
    scale image from sbuf to dbuf
    vp_release_buffer(sbuf)
      A->release_buffer
        vp_release_buffer(...->priv->source)
    return dbuf

Notes:

[*1] query_format is called to determine which formats the destination
supports natively. If no acceptable native formats are found, config
will be called with whatever format source prefers to use, and
destination will be responsible for converting images after receiving
them from vp_pull_frame.

[*2] Here filter B waits to query which formats the VO supports until
it is configured. Since scale's input and output formats are
independent of one another, there's no need to know during scale's
query_format which formats the VO supports.

[*3] Notice here that filter A does not release the source buffer it
obtained from L1 at this time. Instead it stores it in the private
data area for its exported destination buffer, so that it can release
the source after (and only after) that buffer is no longer in use.

----------------------------------------------------------------------

The next (and maybe most controversial) topic: automatic format
conversion!

Believe it or not, it is possible with the above design to dynamically
insert a filter between source and destination during source's
pull_frame. It only requires very minor hacks in vp_pull_frame. But
instead I would like to propose doing away with auto-insertion of
scale filter for format conversion, and instead require filters/vo to
accept any image format.

Then, we introduce a new function to the vp api, vp_convert. I'll
explain it with pseudocode:

vp_buffer *vf_convert(vp_buffer *in) {
	vp_buffer *out = vp_get_buffer(in->link);
	swScaler(in, out);
	vp_release_buffer(in);
	return out;
}

Note that each buffer stores which link it's associated with (in->link
here). Of course vp_convert would also have to keep the sws context
somewhere; in->link would be an appropriate place. Also note that this
will direct-render the conversion if the calling filter supports
direct rendering. :)

Now let's see how this affects format negotiation...

G1's model was to have query_format only return true if this filter
and ALL the subsequent filters/VO support the requested format. Since
G1 could really only auto-insert scale at a few places in the chain
(beginning or end...?) this made sense. But a side effect of this
behavior is that conversion tends to get forced as early as possible
in the filter chain.

Consider the example:

RGB codec ----> crop ----> YUV VO

If crop's query_format returns false because VO does not support RGB,
then RGB->YUV conversion will happen before cropping. But this is
stupid and wastes cpu time.

Now suppose that we're using the above model, with no auto-insertion
of filters. the RGB codec sees that crop's query_format returns false
for RGB, but since it can't output anything except RGB, it returns an
RGB image anyway. Now, crop gets the RGB image. And crop is free to
crop the image in RGB space, since it knows how to do that, totally
oblivious to what the VO wants. Then the VO gets an RGB image, and has
to call vp_convert, which will direct-render the converted image into
video memory if possible.

On the other hand, vf_expand might want to be more careful of what
formats its destination filter supports natively (using query_format)
so it doesn't force the destination to convert lots of useless black
bars.

Finally, one other benefit of converting as late as possible, is that
a filter which drops frames might be able to determine it wants to
drop the next frame before calling vp_convert. This could save a lot
of cpu time. But the following plan for frame dropping makes the
situation even better:

----------------------------------------------------------------------

What happens if in addition to vp_pull_frame, we also have
vp_skip_frame, which notifies the source filter that the destination
wants to "run the pipeline" for a frame, but throw away the output?

The idea is that this call could propagate back as far as possible
through the filter chain. It allows us to have the same behavior as
-framedrop in G1, but also much better. If a filter knows it's going
to drop the next frame before even looking at it, it can use
vp_skip_frame instead of vp_pull_frame, and earlier filters can skip
processing the frame altogether. BUT, if there are filters in the
chain which cannot deal with missing frames (for example, inverse
telecine), they're not obligated to propagate the vp_skip_frame call,
and they can implement their skip_frame with the same function as
pull_frame.

If vp_skip_frame propagates all the way back to the decoder, and the
next frame is a B frame (or the file is I-only), then the decoder can
of course skip decoding entirely!

As for filters which voluntarily drop frames (vf_decimate)...
pull_frame is required to return a valid image unless:

1. A fatal error occurred.
2. The end of the movie has been reached.

So, if a filter wants to drop some frames, that's ok, but it can't
just return NULL from pull_frame. Instead it could do something like
the following:

sbuf=vp_pull_frame(prev);
if (skip) {
    vp_release_buffer(sbuf);
    sbuf=vp_pull_frame(prev);
}

Or, if it knows which frame it wants to skip without looking at the
image contents first, it could call vp_skip_frame instead to save some
cpu time!

One more thing to keep in mind: PTS in G2 propagates through the
video pipeline! So, if a filter drops a frame, it has to add the
relative_pts of that frame to the relative_pts of the next non-skipped
frame before returning it! Otherwise you'll ruin A/V sync!

----------------------------------------------------------------------

OK, I think that's about all for now. I hope this serves as a (mostly)
complete G2 video pipeline design document. If there are no
objections, I may start coding this within the next few weeks. Speak
up now if you want anything changed!!

Rich