[FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix handling of backslashes

Fri Feb 4 07:34:25 EET 2022

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Friday, February 4, 2022 2:58 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> > Oneric
> > Sent: Friday, February 4, 2022 2:01 AM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel at ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> > On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
> > > I think when you inject that word-joiner as a workaround for ass
> > > parsing, you'll also need to make sure that it gets removed
> > > when encoding to other formats.
> >
> > There's no way of knowing whether the word-joiner comes from
> > a conversion performed by ffmpeg in the past or already existed
> > in the original source.
> 
> That might be true, but I think it's valid to say that such characters
> are very unusual "original" subtitle sources and that's why I don't
> think it's a good idea for ffmpeg to start injecting them.
> 
> Subtitle implementations are often rather minimal, especially in
> hardware devices and might not always cover the full range of
> UTF-8 specifics.
> 
> > However, the wordjoiner does not alter the visually appearance and
> > is unlikely to change line-breaking properties; that's why I chose
> > a word-joiner. Therefore I don't think removing (only) the inserted
> > word-joiners is possible,
> 
> Why not? As it seems to be required for ASS encoding only, all other
> output formats should remain unaffected.
> 
> > but also not necessary.
> 
> I'm not sure whether all ffmpeg text-sub encoders can handle
> those chars - which could be verified of course.
> 
> But what remains is the question about the effect on end devices
> which are consuming that output.
> 
> Finally, those chars are a pest. I'm using them myself for a
> specific use case, but when you don't know they are there, it can
> drive you totally mad, eventually even thinking your system or
> software is faulty.
> 
> Example:
> 
> Open your patch file [2/2] and search for the string
> "123456\NAscending". You can see the string in two lines, but search
> will only find one of them.
> 
> Or just look at the two lines directly. They are preceded by + and -
> even though both appear identical.
> 
> 
> So, this also needs consideration of the consequences, like how
> many developers (inside and outside of ffmpeg) this would be driving
> nuts over the years and make them start hating ffmpeg for doing so
> once they've found out.

As I really hate how many devs on this ML keep saying 'no' to submitted
code without having a better suggestion, assuming that this is all that
it takes, I don't want to assimilate in this regard.

Hence I want to propose the following solution:

First of all, the existing code in ff_ass_bprint_text_event() is totally
wrong already. Not only with regard to the backslash escaping (as you 
have already pointed out), but also the curly brace escaping is invalid.
There is no curly-brace escaping in ASS either. 

In fact it is impossible with ASS to display an opening curly brace followed
by a closing curly brace at a subsequent position (each one alone may work
depending on implementation).

If it was about ASS alone, we might just drop those braces, so we could
at least avoid the text in-between from being hidden (when outputting 
ASS), but ASS is also the internal ("uncompressed/raw") subtitle format
in ffmpeg that is used for conversion (and subtitle filtering).
So it would be hard-to-sell when curly braces would get lost when 
converting from one text-sub format to another with none of them 
even being ASS.

What we need is to stop creating invalid ASS and at the same time 
ensure proper conversion of curly braces. How? We substitute them!

And still, UTF-8 can come to the rescue. There are two suitable 
candidates for that:

SMALL LEFT CURLY BRACKET (U+FE5B, Ps): ﹛
SMALL RIGHT CURLY BRACKET (U+FE5C, Pe): ﹜
FULLWIDTH LEFT CURLY BRACKET (U+FF5B, Ps): ｛
FULLWIDTH RIGHT CURLY BRACKET (U+FF5D, Pe): ｝

Substitution of curly braces with one of those will prevent ASS from treating
any possible subtitle content as override code.

What remains to be handled now is the backslash case. Now that we can be sure
that we are never inside a sequence that ASS would consider an override code,
only 3 cases are remaining where the backslash has a meaning in ASS dialog 
text:  '\n', '\N' and '\h'.

We can simply escape those sequences by inserting a (no-op) override code
between the backslash and the char. Suitable for this is: {\r}
This code resets inline styles, but since we are coming from plain text subs
in ff_ass_bprint_text_event(), we know that we don't have any inline styles
and it's a no-op to reset the style.

Needless to say that we will of course change the substituted curly braces
back to the regular ones at the encoding side for all but ASS.
Remains the question what to do when encoding to ASS: We can either 
keep the alternate brace characters or just remove them (or maybe replace
with square brackets). 

I'm not sure about that last point, but in total, this will be a clean solution
without injecting any weird chars into the subtitle output, and it will fix
multiple incorrect behaviors in the current implementation.

Regards,
softworkz