[FFmpeg-devel] Internal handling of subtitles in ffmpeg

Michael Niedermayer michaelni
Fri Jan 2 02:20:43 CET 2009

On Fri, Jan 02, 2009 at 12:08:00AM +0100, Reimar D?ffinger wrote:
> On Thu, Jan 01, 2009 at 10:36:36PM +0100, Michael Niedermayer wrote:
> > The advantage is the same that there is for using AVCodecContext instead of
> > using a char* of an mpeg4 header to represent the related info.
> > it would very well be possible to make our mpeg2 decoder convert width/height
> > and so on into a mpeg4 bitstream and export that ...
> > Its just that working with int, float, ... is easier than parsing bitstreams
> > or strings
> But that is exactly the point! Width and height for video are always
> simple ints, but once they could be arbitrary formulas wouldn't all you
> do just be inventing yet another encoding for the formulas?

i dont understand what you try to say.
I was arguing to export values through a struct instead of a char* using a
using a complex encoding.

> Would you accept a patch that would allow the width and height fields of
> AVCodecContext contain arbitrary formulas to calculate the width and
> height, 

no, because there are no videos using such feature.

> thus forcing anyone properly supporting the FFmpeg API to
> support it even if they will never need it (I guess you would not do
> something as insane for subtitles, but I want to make sure you
> understand my issues).

> > Besides if some information from mpeg2 has no place in mpeg4, its a lot easier
> > to add the extra field or value to a struct than to find some way to squeeze
> > it in a string or bitstream.
> What if MPEGn used XML structs with user defined elements that only very
> few people need? Would it still be the best way to export it that way
> when it muddles the API instead of just letting the people who want the
> really difficult things bear a bit more pain?

if there where xml in mpeg, i would see no problem exporting this in a
new and seperate field. Users wanting it could get it from there, others
could ignore it. I would not convert mpeg2 XML to mpeg4 XML did these exist

> > > What meaning does it have if two text parts (e.g. words) are in a different
> > > AVSubtitleRect? What if they are in the same one? That is unclear to me.
> > 
> > If a subtitle stores both words together in a string both would be in one
> > AVSubtitleRect.
> > If the subtitle stored both seperately each with some position (like left and
> > right middle, right aligned with margin ... or some x/y coords) then they
> > would be in seperate AVSubtitleRects.
> I guess that would include a curve formula if the text should follow
> some curve (not sure how ASS handles that case, they might just split it
> letter by letter).

> > As an analogy it might be that a AVSubtitleRect is a paragraph or similar
> > block level element in html.
> A paragraph has semantics, i.e. it changes how things will be rendered
> in most cases. As I understand your description, one AVSubtitleRect with
> a whole sentence that needs line breaking would be equally valid as 100
> AVSubtitleRect with one letter each.

well no, there are 2 things
a sentance that needs line breaking and is line broken at the right side of
the display must be in a single AVSubtitleRect IMHO
OTOH 100 letters with hardcoded x/y positions can be in seperate
AVSubtitleRects but that is not a line broken paragraph its rather 100
seperate letters, change the font size, display size or otherwize and the
100 will turn into a mess. In that sense i dont think the 100 seperate coded
letters case is realistic

> > > And will you require a width/height for AVSubtitleRect or not?
> > 
> > if theres a w/h stored easily accessable it should be set in AVSubtitleRect
> > by the decoder.
> > Knowing the w/h is also probably usefull to position AVSubtitleRects so they
> > do not overlap.
> Will that allow a non-confusing way to explain the semantics? It should
> not end up like: well, we have here some values. They might mean
> something. Or we might just have guessed them. Or they might be
> completely wrong. Figure it out yourself, we will be happy to see every
> ffmpeg user render the result differently ;-)

its not hard to include a flag that indicates if a value is exact or guessed
of course assuming that there are non exact values anywhere at all.
Also if you want this for any other values in the lav* API its VERY easy
to do, its just noone seems to have asked or it yet ...

> > > Generating those might be a lot of wasted effort for formats that are
> > > similar (the same actually applies to X/Y if they are some sin(time) +
> > > ... I don't know if any subtitle formats actually do this, but they
> > > might specify the position in a way that allows interpolation for frames
> > > generated during deinterlacing, would you want AVSubtitleRect to be able
> > > to handle that as well?).
> > 
> > hmm, i think position interpolation should be supported somehow, but iam not
> > sure how this should be done best ...
> Well, you know, I am trying to convince you to say: hell, let's do the
> simple stuff simple and proper and leave the rest to a complicated
> extension.

Iam still waiting for you to explain your simple & proper solution. It seems
what you suggest has changed somewhat so iam not entirely sure if you still
argue in favor of replacing decoder->encoder by bitstream filters or what
the intermediate format is supposed to be, originally you suggested ASS but it
seems you dont suggest this anymore?

> > > Ok, I'll try it to say it in a different way: I currently feel that by
> > > extending AVSubtitleRect that way you will loose simplicity without
> > > gaining anything, and that worries me (you know, that make simple thinks
> > > really simple and hard things possible thing).
> > 
> > > I think a good criteria for a good API here to me would be that you'd need
> > > maybe 20 lines of code to make ffplay just display all text subtitles on the
> > > console during playback, 
> > 
> > For a single char * with ass inside you need X lines to print them,
> > for AVSubtitleRect you need
> > for(i=0; i<subtitle.num_rects; i++){
> >     The same code used to print ASS somehow to the console.
> > }
> > 
> > thats 2 lines more, seems reasonable to me.
> Well, firstly I did change my ideas somewhat, and was advocating to
> always support converting to a "trivial" text format which allows only
> ordinary text + position.

If you look above at what you wrote, this little comparission was about
AVSubtitleRect or no AVSubtitleRect
Given that with ASS there are 2 lines extra code maximally needed for
the case you picked for AVSubtitleRect than no AVSubtitleRect.
Similarly, for "simple" text 2 lines extra would be needed.
and one of these 2 lines is a "}"

Of course if now you compare ASS+AVSubtitleRect vs. ASCII then no doubt
ASCII wins in terms of rendering on a terminal. But really this is not
a argument about AVSubtitleRect then anymore but more one about ASS vs.
ASCII and it was you who suggested ASS as a good choice for a generic
If you want to convince me to drop ASS in AVSubtitleRect and rather use
UTF8, for now, i dont think you would have much difficulty convincing me.
but that would mean no formating at all for now ...

> But, to "refute" your example, that only works if you can force the
> subtitle decoder to output all AVSubtitleRect as ASS, otherwise you must
> add a if here to filter out the bitmaps.

your code will also need a if() somewhere to stop
graphics from dvd/dvb subtitle decoders. Thus this 1 line if() would be
needed by both.

> Then, depending on how you actually do the coordinate stuff, you might
> have to call a function to convert the coordinate formulas to actual
> numbers. 

> Or will that information maybe already be encoded in the ASS?

yes, i think this would be reasonable

> Or maybe the offsets in the ASS string must be added to those of the
> AVSubtitleRect?

no, definitly not

> Regardless of how you do it, it is one more question any
> user of the API must think about (or they will do like I do often enough,
> just do it the simple way, if it's wrong it's the API designers fault
> for over-engineering).
> > > and maybe 50 more to display them at somewhat
> > > accurate positions (including setup work for ncurses or some such, and
> > > those numbers actually feel a bit high to me).
> > 
> > Actually i suspect that AVSubtitleRect will need fewer lines than a
> > single ass string.
> > 
> > Here is what it may look like:
> > for(i=0; i<subtitle.num_rects; i++){
> >     av_subtitle_get_position(subtitle.rect[i], movie_width, movie_height, pts, 80, 25, &x, &y);
> >     <the code to print a ass fragment at x,y>
> > }
> > 
> > if OTOH you have a single char* then at the least you first need to split it
> > into seperately positioned parts.
> Well, to be honest, for such a simple use case any representation that
> contains any ASS is too complex, you can't do any proper "collision
> detection" or anything once anything more than AVSubtitleRect + plain
> text is involved.

Collision detection is a mandatory part of ASS rendering, even if you
dislike it. Avoiding AVSubtitleRect will not avoid it nor will it make it
You can of course simply not do it and hope it does not look too wrong.
Collisions:      This determines how subtitles are moved, when
               automatically preventing onscreen collisions.

                             If the entry says "Normal" then SSA will
               attempt to position subtitles in the position specified by
               the "margins". However, subtitles can be shifted vertically
               to prevent onscreen collisions. With "normal" collision
               prevention, the subtitles will "stack up" one above the
               other - but they will always be positioned as close the
               vertical (bottom) margin as possible - filling in "gaps" in
               other subtitles if one large enough is available.

                             If the entry says "Reverse" then subtitles
               will be shifted upwards to make room for subsequent
               overlapping subtitles. This means the subtitles can nearly
               always be read top-down - but it also means that the first
               subtitle can appear half way up the screen before the
               subsequent overlapping subtitles appear. It can use a lot of
               screen area.


Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The educated differ from the uneducated as much as the living from the
dead. -- Aristotle 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090102/27504cbb/attachment.pgp>

More information about the ffmpeg-devel mailing list