[FFmpeg-devel] [RFC] AST subtitles

Sun Nov 25 19:46:58 CET 2012

Le quartidi 4 frimaire, an CCXXI, Clément Bœsch a écrit :
> Hi there,
> 
> I wrote a new prototype for storing text subtitles instead of a custom ASS
> line like we currently do. It's trying to be flexible enough to be able to
> deal with any kind of text subtitles markups, while being as simple as
> possible for our users, but also for decoders and encoders.
> 
> Of course, we will have to deal with retro compat. The simpler I found
> here was to introduce a new AVSubtitleType (SUBTITLE_AST), and we would
> use a new field AVSubtitleRect->ast instead of AVSubtitleRect->ass.

That is reasonable. We can also get avcodec_decode_subtitle2 to fill in
AVSubtitleRect->ass from AVSubtitleRect->ast or AVSubtitleRect->ast from
AVSubtitleRect->ass, to get all decoders up-to-date transparently and to
ensure compatibility for those that are converted.

> I used "AST" initially, but it's actually not an AST, so far it's just
> kind of a list, feel free to propose a better random name; I took this
> name because it expresses the fact that it's an arbitrary structure
> layout, and not a data buffer like currently.

"AST" is rather obscure, on top of it being misleading.
AVSubtitleStyledChunk? It does not matter much for the global design, of
course.

> 
> Anyway, here are the basic structures:
> 
>     typedef struct AVSubtitleASTChunk {
>         int type;           ///< one of the AVSUBTITLE_AST_SETTING_TYPE_*
>         int reset;          /**< this chunk restores the setting to the default
>                                  value (or disable the previous one in nested
>                                  mode) */
>         union {
>             char *s;        ///< must be a av_malloc'ed string if string type
>             double d;
>             int i;
>             int64_t i64;
>             uint32_t u32;
>             void *p;        /**< pointer to allocated data of an arbitrary
>                                  size (chunk type dependent) */
>         };
>         int p_nb;           /**< number of entries in p, can be used for
>                                  variable sized data */
>     } AVSubtitleASTChunk;

The "p_nb" field name is inconsistent.

Also, I wonder whether we should optimize the memory allocation, by putting
the extra allocated data (s or p) at the end of the structure

> 
>     typedef struct AVSubtitleASTSettings {
>         char *name;             ///< optional settings name reference
>         int nb;                 ///< number of allocated chunks
>         AVSubtitleASTChunk *v;  ///< array of nb chunks
>     } AVSubtitleASTSettings;
> 
>     typedef struct AVSubtitleAST {
>         const AVSubtitleASTSettings *g_settings;  /**< pointer to one of the global
>                                                        settings for the subtitle event */
>         AVSubtitleASTChunk *chunks;               ///< styles and text chunks
>         int nb_chunks;                            ///< number of chunks
>     } AVSubtitleAST;

I would be happier if the naming were a bit more consistent, especially the
v/chunks field. The "g_" in "g_settings" is strange too.

Adding an AVClass field on some of these structs may be a good idea too.

> A decoder will output an AVSubtitleAST for one event (we can imagine
> multiple events at the same time in different AVSubtitleRect).
> 
> The main functions are:
> 
>     AVSubtitleAST *av_sub_ast_alloc(void);
>     int av_sub_ast_add_chunk(AVSubtitleAST *sub, AVSubtitleASTChunk chunk);
>     void av_sub_ast_free(AVSubtitleAST *sub);

It looks rather inefficient with regard to memory reallocation. I suggest to
add an int argument to av_sub_ast_alloc() to indicate how many chunks it is
likely to receive, and use a doubling reallocation in av_sub_ast_add_chunk()
(it requires an additional integer field).

> 
>     AVSubtitleASTSettings *av_sub_ast_settings_alloc(const char *name);
>     int av_sub_ast_add_setting(AVSubtitleASTSettings *settings, AVSubtitleASTChunk chunk);
>     void av_sub_ast_settings_free(AVSubtitleASTSettings *settings);

Same remarks for that.

> 
>     int av_sub_ast_nested_to_flat(AVSubtitleAST *sub);
>     void av_sub_ast_cleanup(AVSubtitleAST *sub); // assume flat
>     void av_sub_ast_dump(const AVSubtitleAST *sub);
> 
> Note that contrary to the structures, all these functions are private (they are
> only necessary for decoders, users and encoders will browse the structures),
> so please don't mind the "av_" prefix.
> 
> And finally here is a non exhaustive (yet) list of chunks:
> 
>     enum {
>         AVSUBTITLE_AST_CHUNK_RAW_TEXT      = MKBETAG('t','e','x','t'),  // s
>         AVSUBTITLE_AST_CHUNK_COMMENT       = MKBETAG('c','o','m',' '),  // s
>         AVSUBTITLE_AST_CHUNK_TIMING        = MKBETAG('t','i','m','e'),  // i64
>         AVSUBTITLE_AST_CHUNK_KARAOKE       = MKBETAG('k','a','r','a'),  // i
>         AVSUBTITLE_AST_CHUNK_FONTNAME      = MKBETAG('f','o','n','t'),  // s
>         AVSUBTITLE_AST_CHUNK_FONTSIZE      = MKBETAG('f','s','i','z'),  // i
>         AVSUBTITLE_AST_CHUNK_COLOR         = MKBETAG('c','l','r','1'),  // u32
>         AVSUBTITLE_AST_CHUNK_COLOR_2       = MKBETAG('c','l','r','2'),  // u32
>         AVSUBTITLE_AST_CHUNK_COLOR_OUTLINE = MKBETAG('c','l','r','O'),  // u32
>         AVSUBTITLE_AST_CHUNK_COLOR_BACK    = MKBETAG('c','l','r','B'),  // u32
>         AVSUBTITLE_AST_CHUNK_BOLD          = MKBETAG('b','o','l','d'),  // i
>         AVSUBTITLE_AST_CHUNK_ITALIC        = MKBETAG('i','t','a','l'),  // i
>         AVSUBTITLE_AST_CHUNK_STRIKEOUT     = MKBETAG('s','t','r','k'),  // i
>         AVSUBTITLE_AST_CHUNK_UNDERLINE     = MKBETAG('u','n','l','n'),  // i
>         AVSUBTITLE_AST_CHUNK_BORDER_STYLE  = MKBETAG('b','d','e','r'),  // i
>         AVSUBTITLE_AST_CHUNK_OUTLINE       = MKBETAG('o','u','t','l'),  // i
>         AVSUBTITLE_AST_CHUNK_SHADOW        = MKBETAG('s','h','a','d'),  // i
>         AVSUBTITLE_AST_CHUNK_ALIGNMENT     = MKBETAG('a','l','g','n'),  // i
>         AVSUBTITLE_AST_CHUNK_MARGIN_L      = MKBETAG('m','a','r','L'),  // i
>         AVSUBTITLE_AST_CHUNK_MARGIN_R      = MKBETAG('m','a','r','R'),  // i
>         AVSUBTITLE_AST_CHUNK_MARGIN_V      = MKBETAG('m','a','r','V'),  // i
>         AVSUBTITLE_AST_CHUNK_ALPHA_LEVEL   = MKBETAG('a','l','p','h'),  // i
>         AVSUBTITLE_AST_CHUNK_POSITION      = MKBETAG('p','o','s',' '),  // p (2 x i32: x, y)
>         AVSUBTITLE_AST_CHUNK_MOVE          = MKBETAG('m','o','v','e'),  // p (4 x i32: x1, y1, x2, y2)
>         AVSUBTITLE_AST_CHUNK_LINEBREAK     = MKBETAG('l','b','r','k'),  // i
>     };
> 
> (Note: using named chunk is handy for debug, and adding/re-order styles without
> breaking API since they will be exposed to the user)
> 
> Here is what a decoder will basically do:
> 
>  - If the markup needs it, the decoder will create default style profiles.
>    To do so, one or more AVSubtitleASTSettings can be allocated using
>    av_sub_ast_add_setting(), with a name for each one. Each of them
>    contains a list of AVSubtitleASTChunk, one for each custom style:
> 
>      "default" [italic=1][bold=1][fontface="Arial"]
>      "fancy"   [color=red][underline=1][fontface="Comic Sans"]
>      ...

I do not see any mention of the styles profiles in AVCodecContext. IIRC, all
global styles need to be accessible at the encoder init stage since they
will go in the codec extradata and then in the file header.

>  - Each time a decoder receive a subtitles buffer, a new AVSubtitleAST is
>    allocated with av_sub_ast_alloc(). If necessary, it can be associated
>    with one of the global AVSubtitleASTSettings for the default values.

What about CSS-based styling systems? Several CSS rules can apply to a
single subtitle event.

>    Then while parsing the buffer, the decoder will insert chunks of text
>    or style using av_sub_ast_add_chunk():
> 
>      [text="hello"][color=blue][text="world"]...
> 
> In order to test a bit if that can work, I've rewritten the SubRip
> decoder, which is a bit special since it has a nested markup, while the
> AVSubtitleAST will only be considered as flat (since it's easier to deal
> with for users).
> 
> Let's take an example on how it works with the following markup'ed event:
> 
>     1
>     00:00:00,000 --> 00:00:30,000
>               hello<font color="red">
>     bar<font size="3" color="blue">bla</font>
>     <i><font size="5"             >yyyy</font>xxx</i>
>     </font>
> 
> So first, the decoder allocate a new AVSubtitleAST, and fill it with
> chunks. This is what it looks like at the end of the parsing:
> 
>     AST subtitle dump 0x258a900:
>       [text] '          hello'
>       [clr1] 00FF0000
>       [lbrk] 1
>       [text] 'bar'
>       [fsiz] 3
>       [clr1] 000000FF
>       [text] 'bla'
>       [fsiz] (RESET/CLOSE)
>       [clr1] (RESET/CLOSE)
>       [lbrk] 1
>       [ital] 1
>       [fsiz] 5
>       [text] 'yyyy'
>       [fsiz] (RESET/CLOSE)
>       [text] 'xxx'
>       [ital] (RESET/CLOSE)
>       [lbrk] 1
>       [clr1] (RESET/CLOSE)
>       [lbrk] 1
> 
> Note that the decoder inserted some "reset" chunks with the "close"
> meaning: these chunks are telling to close the latest open chunk of the
> same type. But in flat representation, it means to reset to the default
> style. That is why this decoder is required to call after parsing
> av_sub_ast_nested_to_flat(), which will change the AVSubtitleAST into:
> 
> AST subtitle dump 0x258a900:
>   [text] '          hello'
>   [clr1] 00FF0000
>   [lbrk] 1
>   [text] 'bar'
>   [fsiz] 3
>   [clr1] 000000FF
>   [text] 'bla'
>   [fsiz] (RESET/CLOSE)
>   [clr1] 00FF0000
>   [lbrk] 1
>   [ital] 1
>   [fsiz] 5
>   [text] 'yyyy'
>   [fsiz] (RESET/CLOSE)
>   [text] 'xxx'
>   [ital] (RESET/CLOSE)
>   [lbrk] 1
>   [clr1] (RESET/CLOSE)
>   [lbrk] 1
> 
> Now the reset chunks really means a fallback to the default.
> 
> Another nice thing we can do now is to clean-up the whole thing with
> av_sub_ast_cleanup():
> 
> AST subtitle dump 0x258a900:
>   [text] 'hello'
>   [clr1] 00FF0000
>   [lbrk] 1
>   [text] 'bar'
>   [fsiz] 3
>   [clr1] 000000FF
>   [text] 'bla'
>   [fsiz] (RESET/CLOSE)
>   [clr1] 00FF0000
>   [lbrk] 1
>   [ital] 1
>   [fsiz] 5
>   [text] 'yyyy'
>   [fsiz] (RESET/CLOSE)
>   [text] 'xxx'
> 
> That function trims what's not text at the end (style tags and line
> breaks). It also trims the initial spaces. Now this event can be perfectly
> represented into a flat markup (such as ASS), and it's pretty easy to
> write an ASS encoder from this.

Looks reasonable.

> Now the other way around (encoding to a nested markup such as SubRip)
> isn't that complicated either. The encoder just needs to make sure all the
> tags are closed at the end by browsing the list in reverse. There is on
> the other hand a little problem of overhead: one chunk can contain only
> one style, which means you will get on the output multiple <font> tags
> with one attribute instead of one <font> with multiple attributes. I'm not
> sure that's worth trying to "compress" this though, given the complexity
> it might add for simple cases. The SubRip encoder can of course have its
> own heuristics to deal with this.

I agree. A basic version of the grouping algorithm seems pretty simple for
any particular case. Factoring can come later if we see similar code.

> Anyway, I still have various integration problems because of the API and
> ABI constraints we have, but I think we can do something with this stuff.
> 
> Comments?

It looks very good on the whole. Thanks for having worked on it.

What happened to the project of getting avcodec_{en,de}code_subtitle() work
with an AVPacket, and replacing AVSubtitle with another structure that is
not user-allocated?

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20121125/d99c0df6/attachment.asc>