[FFmpeg-devel] Internal handling of subtitles in ffmpeg

Michael Niedermayer michaelni
Thu Jan 1 22:36:36 CET 2009

On Thu, Jan 01, 2009 at 08:38:57PM +0100, Reimar D?ffinger wrote:
> On Thu, Jan 01, 2009 at 06:56:45PM +0100, Michael Niedermayer wrote:
> > > > Also in the light of "horribly complex", does it not feel horribly complex
> > > > to require every ASS->X bitstream filter to be able to extract things like
> > > > position, i mean in my suggestion these would be stored in a easy accessable
> > > > struct doing the extraction just at one spot.
> > > 
> > > And they would be wrong for any "non-trivial" text subtitle.
> > 
> > I think you misunderstand what iam suggesting
> > I do not suggest to convert "left margin 5, top middle" to (512,50)-(600,100)
> > but rather store exactly a semantically equivalent for
> > "left margin 5, top middle" in AVSubtitleRect
> Well, then instead of every encoder implementing a ASS decoder they all
> implement a AVSubtitleRect decoder?
> As I see it, either your AVSubtitleRect can represent only a small
> fraction (well, probably quite a large fraction of what is actually
> used) or it is no longer any simpler than an ASS blob.
> The question is, what is AVSubtitleRect or whatever you want to call it
> supposed to represent? What is the advantage it is supposed to add?

The advantage is the same that there is for using AVCodecContext instead of
using a char* of an mpeg4 header to represent the related info.
it would very well be possible to make our mpeg2 decoder convert width/height
and so on into a mpeg4 bitstream and export that ...
Its just that working with int, float, ... is easier than parsing bitstreams
or strings
Besides if some information from mpeg2 has no place in mpeg4, its a lot easier
to add the extra field or value to a struct than to find some way to squeeze
it in a string or bitstream.

> What meaning does it have if two text parts (e.g. words) are in a different
> AVSubtitleRect? What if they are in the same one? That is unclear to me.

If a subtitle stores both words together in a string both would be in one
If the subtitle stored both seperately each with some position (like left and
right middle, right aligned with margin ... or some x/y coords) then they
would be in seperate AVSubtitleRects.
As an analogy it might be that a AVSubtitleRect is a paragraph or similar
block level element in html.
In that light the overwhelming majority of subtitles should only have one
AVSubtitleRect in each "frame".

> And will you require a width/height for AVSubtitleRect or not?

if theres a w/h stored easily accessable it should be set in AVSubtitleRect
by the decoder.
Knowing the w/h is also probably usefull to position AVSubtitleRects so they
do not overlap.
Iam not suggesting that a decoder should call libfreetype to render the text
to find out how large it would be ...

> Generating those might be a lot of wasted effort for formats that are
> similar (the same actually applies to X/Y if they are some sin(time) +
> ... I don't know if any subtitle formats actually do this, but they
> might specify the position in a way that allows interpolation for frames
> generated during deinterlacing, would you want AVSubtitleRect to be able
> to handle that as well?).

hmm, i think position interpolation should be supported somehow, but iam not
sure how this should be done best ...

> > > > and general case here means
> > > > text -> text while not loosing effects when the destination supports the
> > > >     effects
> > > > text -> bitmaps (not a single 95% transparent screen sized bitmap)
> > > > bitmaps -> display (with bitmaps not being colorspace converted twice)
> > > > text+bitmaps -> text+bitmaps
> > > 
> > > Well, I just think you'd have to extend this to have at least those
> > > "basic" subtitle types:
> > > "DATA blob" (ASS with bitmap support extensions?, not possible to correctly
> > > represent as AVSubtitleRects, thus not using them - alternatively
> > > giving up on a common representation format for anything so advanced)
> > > "trivial" bitmap only (using AVSubtitleRects)
> > > "trivial" text only (using AVSubtitleRects)
> > > "trivial" bitmap+text (using AVSubtitleRects)
> > 
> > Please elaborate on what you consider trivial and non trivial, i have
> > difficulty understanding this.
> "trivial": fixed position, no effects/transformations or anything.
> Should be possible to render onto screen with no more than maybe 100
> lines of code.
> That is the meaning AVSubtitleRect has for me currently, due to the way
> it is designed currently: something really easy to put over a video, and
> IMHO it is unacceptable to loose this (but as said it could be a special
> "pixfmt").

> > To me, any way to specify a position in a non ambigous way is equivalent
> > i mean no matter if text is specified with pixel based margins rectangle
> > left/right justified flags, screen or display relative coordinates with some
> > rotation/sheer/... (aka affine transformation) or other.
> Ok, I'll try it to say it in a different way: I currently feel that by
> extending AVSubtitleRect that way you will loose simplicity without
> gaining anything, and that worries me (you know, that make simple thinks
> really simple and hard things possible thing).

> I think a good criteria for a good API here to me would be that you'd need
> maybe 20 lines of code to make ffplay just display all text subtitles on the
> console during playback, 

For a single char * with ass inside you need X lines to print them,
for AVSubtitleRect you need
for(i=0; i<subtitle.num_rects; i++){
    The same code used to print ASS somehow to the console.

thats 2 lines more, seems reasonable to me.
Now i wont dispute that this might fail if some subtitle encoded each word
seperately with a position but in that case its not so clear if the ass
strings would contain them correctly ordered for a 20 line print() to print
them either ...

> and maybe 50 more to display them at somewhat
> accurate positions (including setup work for ncurses or some such, and
> those numbers actually feel a bit high to me).

Actually i suspect that AVSubtitleRect will need fewer lines than a
single ass string.

Here is what it may look like:
for(i=0; i<subtitle.num_rects; i++){
    av_subtitle_get_position(subtitle.rect[i], movie_width, movie_height, pts, 80, 25, &x, &y);
    <the code to print a ass fragment at x,y>

if OTOH you have a single char* then at the least you first need to split it
into seperately positioned parts.

Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Let us carefully observe those good qualities wherein our enemies excel us
and endeavor to excel them, by avoiding what is faulty, and imitating what
is excellent in them. -- Plutarch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090101/a299ae9c/attachment.pgp>

More information about the ffmpeg-devel mailing list