[FFmpeg-devel] Internal handling of subtitles in ffmpeg

Reimar Döffinger Reimar.Doeffinger
Fri Jan 2 00:08:00 CET 2009

On Thu, Jan 01, 2009 at 10:36:36PM +0100, Michael Niedermayer wrote:
> The advantage is the same that there is for using AVCodecContext instead of
> using a char* of an mpeg4 header to represent the related info.
> it would very well be possible to make our mpeg2 decoder convert width/height
> and so on into a mpeg4 bitstream and export that ...
> Its just that working with int, float, ... is easier than parsing bitstreams
> or strings

But that is exactly the point! Width and height for video are always
simple ints, but once they could be arbitrary formulas wouldn't all you
do just be inventing yet another encoding for the formulas?
Would you accept a patch that would allow the width and height fields of
AVCodecContext contain arbitrary formulas to calculate the width and
height, thus forcing anyone properly supporting the FFmpeg API to
support it even if they will never need it (I guess you would not do
something as insane for subtitles, but I want to make sure you
understand my issues).

> Besides if some information from mpeg2 has no place in mpeg4, its a lot easier
> to add the extra field or value to a struct than to find some way to squeeze
> it in a string or bitstream.

What if MPEGn used XML structs with user defined elements that only very
few people need? Would it still be the best way to export it that way
when it muddles the API instead of just letting the people who want the
really difficult things bear a bit more pain?

> > What meaning does it have if two text parts (e.g. words) are in a different
> > AVSubtitleRect? What if they are in the same one? That is unclear to me.
> If a subtitle stores both words together in a string both would be in one
> AVSubtitleRect.
> If the subtitle stored both seperately each with some position (like left and
> right middle, right aligned with margin ... or some x/y coords) then they
> would be in seperate AVSubtitleRects.

I guess that would include a curve formula if the text should follow
some curve (not sure how ASS handles that case, they might just split it
letter by letter).

> As an analogy it might be that a AVSubtitleRect is a paragraph or similar
> block level element in html.

A paragraph has semantics, i.e. it changes how things will be rendered
in most cases. As I understand your description, one AVSubtitleRect with
a whole sentence that needs line breaking would be equally valid as 100
AVSubtitleRect with one letter each.

> > And will you require a width/height for AVSubtitleRect or not?
> if theres a w/h stored easily accessable it should be set in AVSubtitleRect
> by the decoder.
> Knowing the w/h is also probably usefull to position AVSubtitleRects so they
> do not overlap.

Will that allow a non-confusing way to explain the semantics? It should
not end up like: well, we have here some values. They might mean
something. Or we might just have guessed them. Or they might be
completely wrong. Figure it out yourself, we will be happy to see every
ffmpeg user render the result differently ;-)

> > Generating those might be a lot of wasted effort for formats that are
> > similar (the same actually applies to X/Y if they are some sin(time) +
> > ... I don't know if any subtitle formats actually do this, but they
> > might specify the position in a way that allows interpolation for frames
> > generated during deinterlacing, would you want AVSubtitleRect to be able
> > to handle that as well?).
> hmm, i think position interpolation should be supported somehow, but iam not
> sure how this should be done best ...

Well, you know, I am trying to convince you to say: hell, let's do the
simple stuff simple and proper and leave the rest to a complicated

> > Ok, I'll try it to say it in a different way: I currently feel that by
> > extending AVSubtitleRect that way you will loose simplicity without
> > gaining anything, and that worries me (you know, that make simple thinks
> > really simple and hard things possible thing).
> > I think a good criteria for a good API here to me would be that you'd need
> > maybe 20 lines of code to make ffplay just display all text subtitles on the
> > console during playback, 
> For a single char * with ass inside you need X lines to print them,
> for AVSubtitleRect you need
> for(i=0; i<subtitle.num_rects; i++){
>     The same code used to print ASS somehow to the console.
> }
> thats 2 lines more, seems reasonable to me.

Well, firstly I did change my ideas somewhat, and was advocating to
always support converting to a "trivial" text format which allows only
ordinary text + position.
But, to "refute" your example, that only works if you can force the
subtitle decoder to output all AVSubtitleRect as ASS, otherwise you must
add a if here to filter out the bitmaps.
Then, depending on how you actually do the coordinate stuff, you might
have to call a function to convert the coordinate formulas to actual
numbers. Or will that information maybe already be encoded in the ASS?
Or maybe the offsets in the ASS string must be added to those of the
AVSubtitleRect? Regardless of how you do it, it is one more question any
user of the API must think about (or they will do like I do often enough,
just do it the simple way, if it's wrong it's the API designers fault
for over-engineering).

> > and maybe 50 more to display them at somewhat
> > accurate positions (including setup work for ncurses or some such, and
> > those numbers actually feel a bit high to me).
> Actually i suspect that AVSubtitleRect will need fewer lines than a
> single ass string.
> Here is what it may look like:
> for(i=0; i<subtitle.num_rects; i++){
>     av_subtitle_get_position(subtitle.rect[i], movie_width, movie_height, pts, 80, 25, &x, &y);
>     <the code to print a ass fragment at x,y>
> }
> if OTOH you have a single char* then at the least you first need to split it
> into seperately positioned parts.

Well, to be honest, for such a simple use case any representation that
contains any ASS is too complex, you can't do any proper "collision
detection" or anything once anything more than AVSubtitleRect + plain
text is involved.
Thing is, I do see a use case for AVSubtitleRect + something that is in
any way easy to render (ASS is not, not even if you just filter out and
ignore the metadata).
I also see a use case for ASS (not for me personally, but there are some
"crazy people" who insist).
I am not so sure I truly see a use case for AVSubtitleRect + text +
graphics, though it may be useful for watch a movie with two language
subtitles when one is not on the DVD.
I think I do not see any use-case whatsoever for AVSubtitleRect + ass,
it seems to me like arbitrarily splitting up the ASS string for no good
reason. Maybe it could be extended to at least allow easy merging of two
ASS files, but I'd expect even that fail for corner cases at least.

Reimar D?ffinger

More information about the ffmpeg-devel mailing list