[FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix handling of backslashes

Oneric oneric at oneric.de
Sat Feb 5 03:20:16 EET 2022


On Fri, Feb 04, 2022 at 23:24:58 +0000, Soft Works wrote:
> You want to "pollute" gazillions of subtitle streams in the
> world from multiple subtitle formats with invisible
> characters in order to solve an escaping problem in ffmpeg?

I do not consider using characters that are explicitly recommended to be
used by Unicode to be “polluting”. Further consider that as mentioned
invisible characters in ASS are not uncommon anyway already and conversion
from ASS to something else are rare due to being generally lossy. Lossy 
with regards to typesetting that is, removing breaking hints in form of
plain Unicode characters would be a new form of lossyness.

> [From the other mail:]
> I'm not into changing ffmpeg's ass output, it's all
> about the internally used ass format and the escaping is
> a central problem there.

I’m not interested in reworking ffmpeg’s internal subtitle handling.
The proposed patch is a clear improvement over the status quo which
is plain incorrect. Within reasonable effort and sound arguments for
it adjustments to the patch can be made; reworking ffmpeg internals is
imo not “reasonable” effort to correct an uncontestedly wrong escape.

You have two options:
Either finally tell me what I asked about:
where (as in which file and function) removing wordjoiners should
even happen and where possible lingering “\\ → \” conversions presumably
are and if it’s simple enough I can add a removal accompanied by a comment
pointing out that this can go wrong.
Or go ahead and create your own patch.

~~~~~~

> > > I'm not sure whether all ffmpeg text-sub encoders can handle
> > > those chars - which could be verified of course.
> >
> > Since it's in the BMP and ffmpeg already seems happy to assume some
> > UTF-8
> > support by converting everything to it, I'm not worried about this
> > until
> > proven wrong.
>
> Proven wrong: https://github.com/libass/libass/issues/507

This issue is not at all wordjoiner specific despite the name.
As far as I recall this never lead to wrong rendering.
With HarfBuzz, the only fully featured shaping backend of libass,
control characters were and are handled by HarfBuzz.
And even with FriBiDi U+2060 was ignored since long before (2012)
the linked issue was opened.

What that issue really is about is a combination of two more general
issues. libass is currently not caching failure to lookup a glyph leading
to multiple messages and at worst a perf degradation if no font on the
font pool contained a glyph for a particular glyph. And the realisation
that libass’ font-fallback strategy is not ideal for prefix-type control
characters, characters which visibly affect both neighbours and a few
others.
The word-joiner is only highlighted here as due to its usage as an
backslash escape its commonly passed to libass and a high enough
percentage of fonts doesn’t contain it to create reports about it.


For further reference: U+2060 was added in Unicode 3.2 released 2002.
If you want to strip it because it might not render correctly you should
also strip most emoji, the uppercase eszett ẞ and several actively 
used writing systems in their entirety.


More information about the ffmpeg-devel mailing list