[FFmpeg-devel] [RFC] AST subtitles

Sun Nov 25 20:38:42 CET 2012

On Sun, Nov 25, 2012 at 07:46:58PM +0100, Nicolas George wrote:
> Le quartidi 4 frimaire, an CCXXI, Clément Bœsch a écrit :
> > Hi there,
> > 
> > I wrote a new prototype for storing text subtitles instead of a custom ASS
> > line like we currently do. It's trying to be flexible enough to be able to
> > deal with any kind of text subtitles markups, while being as simple as
> > possible for our users, but also for decoders and encoders.
> > 
> > Of course, we will have to deal with retro compat. The simpler I found
> > here was to introduce a new AVSubtitleType (SUBTITLE_AST), and we would
> > use a new field AVSubtitleRect->ast instead of AVSubtitleRect->ass.
> 
> That is reasonable. We can also get avcodec_decode_subtitle2 to fill in
> AVSubtitleRect->ass from AVSubtitleRect->ast or AVSubtitleRect->ast from
> AVSubtitleRect->ass, to get all decoders up-to-date transparently and to
> ensure compatibility for those that are converted.
> 

Yes that's exactly what I was trying to do, but I hit two problems
already, which I don't know how to solve:

 - we need to have an ASS encoder context to do that (and you can't just
   call the encode function directly because you need the global styles
   information), and this gets kind of ugly quickly.

 - the subtitles encoding API is limited: the encoder takes a buffer +
   buffer size instead of an AVPacket, which makes it kind of awkward to
   use within the decoder. The API needs a lift as you said at the end of
   your mail.

> > I used "AST" initially, but it's actually not an AST, so far it's just
> > kind of a list, feel free to propose a better random name; I took this
> > name because it expresses the fact that it's an arbitrary structure
> > layout, and not a data buffer like currently.
> 
> "AST" is rather obscure, on top of it being misleading.
> AVSubtitleStyledChunk? It does not matter much for the global design, of
> course.
> 

I'll keep in mind that proposition for the great rename. Maybe
AVStyledSubtitles prefix for everything could indeed do the trick, thanks.

> > 
> > Anyway, here are the basic structures:
> > 
> >     typedef struct AVSubtitleASTChunk {
> >         int type;           ///< one of the AVSUBTITLE_AST_SETTING_TYPE_*
> >         int reset;          /**< this chunk restores the setting to the default
> >                                  value (or disable the previous one in nested
> >                                  mode) */
> >         union {
> >             char *s;        ///< must be a av_malloc'ed string if string type
> >             double d;
> >             int i;
> >             int64_t i64;
> >             uint32_t u32;
> >             void *p;        /**< pointer to allocated data of an arbitrary
> >                                  size (chunk type dependent) */
> >         };
> >         int p_nb;           /**< number of entries in p, can be used for
> >                                  variable sized data */
> >     } AVSubtitleASTChunk;
> 
> The "p_nb" field name is inconsistent.
> 

nb_p?

> Also, I wonder whether we should optimize the memory allocation, by putting
> the extra allocated data (s or p) at the end of the structure
> 

I'll consider the re-ordering, yes.

> > 
> >     typedef struct AVSubtitleASTSettings {
> >         char *name;             ///< optional settings name reference
> >         int nb;                 ///< number of allocated chunks
> >         AVSubtitleASTChunk *v;  ///< array of nb chunks
> >     } AVSubtitleASTSettings;
> > 
> >     typedef struct AVSubtitleAST {
> >         const AVSubtitleASTSettings *g_settings;  /**< pointer to one of the global
> >                                                        settings for the subtitle event */
> >         AVSubtitleASTChunk *chunks;               ///< styles and text chunks
> >         int nb_chunks;                            ///< number of chunks
> >     } AVSubtitleAST;
> 
> I would be happier if the naming were a bit more consistent, especially the
> v/chunks field. The "g_" in "g_settings" is strange too.
> 

Heh sorry about that; I already removed the g_ locally, I'll make some
random renames when this stuff is ready.

> Adding an AVClass field on some of these structs may be a good idea too.
> 

Mmh, true.

> > A decoder will output an AVSubtitleAST for one event (we can imagine
> > multiple events at the same time in different AVSubtitleRect).
> > 
> > The main functions are:
> > 
> >     AVSubtitleAST *av_sub_ast_alloc(void);
> >     int av_sub_ast_add_chunk(AVSubtitleAST *sub, AVSubtitleASTChunk chunk);
> >     void av_sub_ast_free(AVSubtitleAST *sub);
> 
> It looks rather inefficient with regard to memory reallocation. I suggest to
> add an int argument to av_sub_ast_alloc() to indicate how many chunks it is
> likely to receive, and use a doubling reallocation in av_sub_ast_add_chunk()
> (it requires an additional integer field).
> 

I'm already doing a doubling reallocation. Sorry, I should have pasted the
code:

    /* based on av_dynarray_add() */
    static void *sub_dynarray_alloc(void **tab_ptr, int *nb_ptr, size_t elem_size)
    {
        int nb = *nb_ptr;
        uint8_t *tab = *tab_ptr;

        if ((nb & (nb - 1)) == 0) {
            int nb_alloc = nb == 0 ? 1 : nb * 2;
            tab = av_realloc_f(tab, nb_alloc, elem_size);
            if (!tab)
                return NULL;
            *tab_ptr = tab;
        }
        *nb_ptr = nb + 1;
        return tab + nb*elem_size;
    }

    int av_sub_ast_add_chunk(AVSubtitleAST *sub, AVSubtitleASTChunk chunk)
    {
        AVSubtitleASTChunk *p;

        p = sub_dynarray_alloc((void**)&sub->chunks, &sub->nb_chunks, sizeof(*sub->chunks));
        if (!p)
            return AVERROR(ENOMEM);
        *p = chunk;
        return 0;
    }

    int av_sub_ast_add_setting(AVSubtitleASTSettings *settings, AVSubtitleASTChunk chunk)
    {
        AVSubtitleASTChunk *p;

        p = sub_dynarray_alloc((void**)&settings->v, &settings->nb, sizeof(*settings->v));
        if (!p)
            return AVERROR(ENOMEM);
        *p = chunk;
        return 0;
    }

> > 
> >     AVSubtitleASTSettings *av_sub_ast_settings_alloc(const char *name);
> >     int av_sub_ast_add_setting(AVSubtitleASTSettings *settings, AVSubtitleASTChunk chunk);
> >     void av_sub_ast_settings_free(AVSubtitleASTSettings *settings);
> 
> Same remarks for that.
> 

See above.

Now another solution might be to make av_sub_ast_add_{chunk,setting}
returns a AVSubtitleASTChunk pointer, but I'm afraid it will be a little
convoluted for decoders who wants to prepare a chunk and might drop it if
the parsing doesn't end properly. Right now they use a AVSubtitleASTChunk
on the stack, fill it progressively, and if it gets fill with everything
needed it can be added (and the allocation happens at that time internally
if necessary).

> > 
> >     int av_sub_ast_nested_to_flat(AVSubtitleAST *sub);
> >     void av_sub_ast_cleanup(AVSubtitleAST *sub); // assume flat
> >     void av_sub_ast_dump(const AVSubtitleAST *sub);
> > 
> > Note that contrary to the structures, all these functions are private (they are
> > only necessary for decoders, users and encoders will browse the structures),
> > so please don't mind the "av_" prefix.
> > 
> > And finally here is a non exhaustive (yet) list of chunks:
> > 
> >     enum {
> >         AVSUBTITLE_AST_CHUNK_RAW_TEXT      = MKBETAG('t','e','x','t'),  // s
> >         AVSUBTITLE_AST_CHUNK_COMMENT       = MKBETAG('c','o','m',' '),  // s
> >         AVSUBTITLE_AST_CHUNK_TIMING        = MKBETAG('t','i','m','e'),  // i64
> >         AVSUBTITLE_AST_CHUNK_KARAOKE       = MKBETAG('k','a','r','a'),  // i
> >         AVSUBTITLE_AST_CHUNK_FONTNAME      = MKBETAG('f','o','n','t'),  // s
> >         AVSUBTITLE_AST_CHUNK_FONTSIZE      = MKBETAG('f','s','i','z'),  // i
> >         AVSUBTITLE_AST_CHUNK_COLOR         = MKBETAG('c','l','r','1'),  // u32
> >         AVSUBTITLE_AST_CHUNK_COLOR_2       = MKBETAG('c','l','r','2'),  // u32
> >         AVSUBTITLE_AST_CHUNK_COLOR_OUTLINE = MKBETAG('c','l','r','O'),  // u32
> >         AVSUBTITLE_AST_CHUNK_COLOR_BACK    = MKBETAG('c','l','r','B'),  // u32
> >         AVSUBTITLE_AST_CHUNK_BOLD          = MKBETAG('b','o','l','d'),  // i
> >         AVSUBTITLE_AST_CHUNK_ITALIC        = MKBETAG('i','t','a','l'),  // i
> >         AVSUBTITLE_AST_CHUNK_STRIKEOUT     = MKBETAG('s','t','r','k'),  // i
> >         AVSUBTITLE_AST_CHUNK_UNDERLINE     = MKBETAG('u','n','l','n'),  // i
> >         AVSUBTITLE_AST_CHUNK_BORDER_STYLE  = MKBETAG('b','d','e','r'),  // i
> >         AVSUBTITLE_AST_CHUNK_OUTLINE       = MKBETAG('o','u','t','l'),  // i
> >         AVSUBTITLE_AST_CHUNK_SHADOW        = MKBETAG('s','h','a','d'),  // i
> >         AVSUBTITLE_AST_CHUNK_ALIGNMENT     = MKBETAG('a','l','g','n'),  // i
> >         AVSUBTITLE_AST_CHUNK_MARGIN_L      = MKBETAG('m','a','r','L'),  // i
> >         AVSUBTITLE_AST_CHUNK_MARGIN_R      = MKBETAG('m','a','r','R'),  // i
> >         AVSUBTITLE_AST_CHUNK_MARGIN_V      = MKBETAG('m','a','r','V'),  // i
> >         AVSUBTITLE_AST_CHUNK_ALPHA_LEVEL   = MKBETAG('a','l','p','h'),  // i
> >         AVSUBTITLE_AST_CHUNK_POSITION      = MKBETAG('p','o','s',' '),  // p (2 x i32: x, y)
> >         AVSUBTITLE_AST_CHUNK_MOVE          = MKBETAG('m','o','v','e'),  // p (4 x i32: x1, y1, x2, y2)
> >         AVSUBTITLE_AST_CHUNK_LINEBREAK     = MKBETAG('l','b','r','k'),  // i
> >     };
> > 
> > (Note: using named chunk is handy for debug, and adding/re-order styles without
> > breaking API since they will be exposed to the user)
> > 
> > Here is what a decoder will basically do:
> > 
> >  - If the markup needs it, the decoder will create default style profiles.
> >    To do so, one or more AVSubtitleASTSettings can be allocated using
> >    av_sub_ast_add_setting(), with a name for each one. Each of them
> >    contains a list of AVSubtitleASTChunk, one for each custom style:
> > 
> >      "default" [italic=1][bold=1][fontface="Arial"]
> >      "fancy"   [color=red][underline=1][fontface="Comic Sans"]
> >      ...
> 
> I do not see any mention of the styles profiles in AVCodecContext. IIRC, all
> global styles need to be accessible at the encoder init stage since they
> will go in the codec extradata and then in the file header.
> 

Yes, I don't know yet, we'll indeed likely need to allocate the
AVSubtitleASTSettings into the AVCodecContext in the decoder init
callback. The SubRip decoder doesn't have global settings so I didn't have
yet to think much about it.

> >  - Each time a decoder receive a subtitles buffer, a new AVSubtitleAST is
> >    allocated with av_sub_ast_alloc(). If necessary, it can be associated
> >    with one of the global AVSubtitleASTSettings for the default values.
> 
> What about CSS-based styling systems? Several CSS rules can apply to a
> single subtitle event.

I'd say the decoder will have to make its own list of profiles depending
on the set of styles it expects. I don't really want to make the users and
encoders deal with with complex trees of styles and inheritance processes.
That information won't be restored properly in most of the output
subtitles, so since we will likely have to "flatten" this stuff before
encoding, I'd say it's up to the decoder to make the stupid markup it is
parsing accessible & simple for any encoder.

[...]
> > Now the reset chunks really means a fallback to the default.
> > 
> > Another nice thing we can do now is to clean-up the whole thing with
> > av_sub_ast_cleanup():
> > 
> > AST subtitle dump 0x258a900:
> >   [text] 'hello'
> >   [clr1] 00FF0000
> >   [lbrk] 1
> >   [text] 'bar'
> >   [fsiz] 3
> >   [clr1] 000000FF
> >   [text] 'bla'
> >   [fsiz] (RESET/CLOSE)
> >   [clr1] 00FF0000
> >   [lbrk] 1
> >   [ital] 1
> >   [fsiz] 5
> >   [text] 'yyyy'
> >   [fsiz] (RESET/CLOSE)
> >   [text] 'xxx'
> > 
> > That function trims what's not text at the end (style tags and line
> > breaks). It also trims the initial spaces. Now this event can be perfectly
> > represented into a flat markup (such as ASS), and it's pretty easy to
> > write an ASS encoder from this.
> 
> Looks reasonable.
> 
> > Now the other way around (encoding to a nested markup such as SubRip)
> > isn't that complicated either. The encoder just needs to make sure all the
> > tags are closed at the end by browsing the list in reverse. There is on
> > the other hand a little problem of overhead: one chunk can contain only
> > one style, which means you will get on the output multiple <font> tags
> > with one attribute instead of one <font> with multiple attributes. I'm not
> > sure that's worth trying to "compress" this though, given the complexity
> > it might add for simple cases. The SubRip encoder can of course have its
> > own heuristics to deal with this.
> 
> I agree. A basic version of the grouping algorithm seems pretty simple for
> any particular case. Factoring can come later if we see similar code.
> 
> > Anyway, I still have various integration problems because of the API and
> > ABI constraints we have, but I think we can do something with this stuff.
> > 
> > Comments?
> 
> It looks very good on the whole. Thanks for having worked on it.
> 

The more I do this, the more I realize there is a lot of things to solve
before that…

> What happened to the project of getting avcodec_{en,de}code_subtitle() work
> with an AVPacket, and replacing AVSubtitle with another structure that is
> not user-allocated?
> 

The decoding function is already taking an AVPacket and outputting an
AVSubtitle. The encoding function on the other hand is pretty much a relic
of the past, with one buffer + a buffer size as stated at the beginning of
this mail (look at how ugly it's done in ffmpeg…). This is IMO what we
should acknowledge first, but it's not simple.

There are actually all sort of factors to deal with, so let me summarize:

 - as you just said, making AVSubtitle heap allocated is one step ahead,
   but it's actually not blocking for what I'm doing right now: the
   SUBTITLE_AST means to add a field in the rectangle structure (which is
   allocated internally), not AVSubtitle, so it shouldn't really matter.
   Though, it might be required later, so feel free to do it.

 - currently, the text subtitles decoder are filling the rects[x]->ass
   fields not only with the ASS "payload" but as if they were /lines/ of a
   ASS file. The transformation can be summarized as the following:

     o "Dialogue: " is added at the beginning
     o start time and end time are added in the payload
     o field order is dropped
     o \r\n is added at the end

   This is problematic because these packets data can not be sent to
   libass in a sane manner: instead of using libass/ass_process_chunk()
   like you would do with a simple packet, you need to call
   libass/ass_process_data() (note that this was added to libass for that
   specific reason), to re-parse the whole line. This is by the way why
   MPlayer is doing a memcmp("Dialogue:"...) on the data packet…

   Now that AVPackets contains the pts and duration, I think it's wise to
   make the ASS and Matroska demuxers output these packets in a proper
   format. We will need to change a few things such as making the ASS
   muxer add the timing, and remove the hack from the Matroska muxer.

   I don't mind doing this, but since it will change the layout of the
   packets, it will break application expected ASS line and not ASS raw
   packets. Any opinion?

 - the encoding API with buffer + buffer size needs a re-lift as I just
   said for the 3rd time, so anyone is very welcome to work on this. I
   actually need this to go ahead, but since I don't want to do it I think
   I'll work on something else until someone decides to do it… :)

I think we will need to consider the encoding/charset as well at some
point too, but it should be doable in a nice way with the
AVStyledSubtitles, hopefully.

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20121125/52f8cc24/attachment.asc>