[FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix handling of backslashes

Soft Works softworkz at hotmail.com
Fri Feb 4 08:48:40 EET 2022



> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Friday, February 4, 2022 6:34 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Soft
> > Works
> > Sent: Friday, February 4, 2022 2:58 AM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel at ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> > > Oneric
> > > Sent: Friday, February 4, 2022 2:01 AM
> > > To: FFmpeg development discussions and patches <ffmpeg-
> > > devel at ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> > fix
> > > handling of backslashes
> > >
> > > On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
> > > > I think when you inject that word-joiner as a workaround for ass
> > > > parsing, you'll also need to make sure that it gets removed
> > > > when encoding to other formats.
> > >
> > > There's no way of knowing whether the word-joiner comes from
> > > a conversion performed by ffmpeg in the past or already existed
> > > in the original source.
> >
> > That might be true, but I think it's valid to say that such
> characters
> > are very unusual "original" subtitle sources and that's why I don't
> > think it's a good idea for ffmpeg to start injecting them.
> >
> > Subtitle implementations are often rather minimal, especially in
> > hardware devices and might not always cover the full range of
> > UTF-8 specifics.
> >
> > > However, the wordjoiner does not alter the visually appearance and
> > > is unlikely to change line-breaking properties; that's why I chose
> > > a word-joiner. Therefore I don't think removing (only) the
> inserted
> > > word-joiners is possible,
> >
> > Why not? As it seems to be required for ASS encoding only, all other
> > output formats should remain unaffected.
> >
> > > but also not necessary.
> >
> > I'm not sure whether all ffmpeg text-sub encoders can handle
> > those chars - which could be verified of course.
> >
> > But what remains is the question about the effect on end devices
> > which are consuming that output.
> >
> > Finally, those chars are a pest. I'm using them myself for a
> > specific use case, but when you don't know they are there, it can
> > drive you totally mad, eventually even thinking your system or
> > software is faulty.
> >
> > Example:
> >
> > Open your patch file [2/2] and search for the string
> > "123456\NAscending". You can see the string in two lines, but search
> > will only find one of them.
> >
> > Or just look at the two lines directly. They are preceded by + and -
> > even though both appear identical.
> >
> >
> > So, this also needs consideration of the consequences, like how
> > many developers (inside and outside of ffmpeg) this would be driving
> > nuts over the years and make them start hating ffmpeg for doing so
> > once they've found out.
> 
> As I really hate how many devs on this ML keep saying 'no' to
> submitted
> code without having a better suggestion, assuming that this is all
> that
> it takes, I don't want to assimilate in this regard.
> 
> Hence I want to propose the following solution:
> 
> First of all, the existing code in ff_ass_bprint_text_event() is
> totally
> wrong already. Not only with regard to the backslash escaping (as you
> have already pointed out), but also the curly brace escaping is
> invalid.
> There is no curly-brace escaping in ASS either.
> 
> In fact it is impossible with ASS to display an opening curly brace
> followed
> by a closing curly brace at a subsequent position (each one alone may
> work
> depending on implementation).
> 
> If it was about ASS alone, we might just drop those braces, so we
> could
> at least avoid the text in-between from being hidden (when outputting
> ASS), but ASS is also the internal ("uncompressed/raw") subtitle
> format
> in ffmpeg that is used for conversion (and subtitle filtering).
> So it would be hard-to-sell when curly braces would get lost when
> converting from one text-sub format to another with none of them
> even being ASS.
> 
> What we need is to stop creating invalid ASS and at the same time
> ensure proper conversion of curly braces. How? We substitute them!
> 
> And still, UTF-8 can come to the rescue. There are two suitable
> candidates for that:
> 
> SMALL LEFT CURLY BRACKET (U+FE5B, Ps): ﹛
> SMALL RIGHT CURLY BRACKET (U+FE5C, Pe): ﹜
> FULLWIDTH LEFT CURLY BRACKET (U+FF5B, Ps): {
> FULLWIDTH RIGHT CURLY BRACKET (U+FF5D, Pe): }
> 
> Substitution of curly braces with one of those will prevent ASS from
> treating
> any possible subtitle content as override code.
> 
> What remains to be handled now is the backslash case. Now that we can
> be sure
> that we are never inside a sequence that ASS would consider an
> override code,
> only 3 cases are remaining where the backslash has a meaning in ASS
> dialog
> text:  '\n', '\N' and '\h'.
> 
> We can simply escape those sequences by inserting a (no-op) override
> code
> between the backslash and the char. Suitable for this is: {\r}
> This code resets inline styles, but since we are coming from plain
> text subs
> in ff_ass_bprint_text_event(), we know that we don't have any inline
> styles
> and it's a no-op to reset the style.
> 
> Needless to say that we will of course change the substituted curly
> braces
> back to the regular ones at the encoding side for all but ASS.
> Remains the question what to do when encoding to ASS: We can either
> keep the alternate brace characters or just remove them (or maybe
> replace
> with square brackets).
> 
> I'm not sure about that last point, but in total, this will be a clean
> solution
> without injecting any weird chars into the subtitle output, and it
> will fix
> multiple incorrect behaviors in the current implementation.


I've found out where the \{ and \} escaping has come from: libass
They decided at some time to introduce this kind of escaping which
is actually incompatible with normal ASS syntax and libass specific:
https://github.com/libass/libass/issues/194#issuecomment-352213210


This doesn't mean though, that the ffmpeg internal ASS format needs
to follow the libass route in this regard. It only matters for the
libass output encoder, because \{\r}N is broken by that libass
decision, so for this case, we'll need a different way.

sw





More information about the ffmpeg-devel mailing list