[FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix handling of backslashes

Oneric oneric at oneric.de
Fri Feb 4 23:52:08 EET 2022


On Fri, Feb 04, 2022 at 02:30:37 +0100, Andreas Rheinhardt wrote:
> All text-based subtitles are supposed to be UTF-8 when they reach the
> decoder; if it isn't, the user has to set the appropriate -sub_charenc
> and -sub_charenc_mode.
> 
> - Andreas

Thanks for the info! Then at least the UTF-8 assumption
is no problem after all.


On Fri, Feb 04, 2022 at 01:57:48 +0000, Soft Works wrote:
> > There's no way of knowing whether the word-joiner comes from
> > a conversion performed by ffmpeg in the past or already existed
> > in the original source.
> 
> That might be true, but I think it's valid to say that such characters
> are very unusual "original" subtitle sources and that's why I don't
> think it's a good idea for ffmpeg to start injecting them.

Don't underestimate what subtitle authors can come up with :)

> Subtitle implementations are often rather minimal, especially in
> hardware devices and might not always cover the full range of 
> UTF-8 specifics.

The wordjoiner lies in the Basic Multilingual Plane, so even ancient UTF-8 
implementations assuming all of Unicode’s codepoints fit in 16bits
(i.e. 3-bytes max per codepoint in UTF-8) will be able to understand it.

> > However, the wordjoiner does not alter the visually appearance and
> > is unlikely to change line-breaking properties; that's why I chose
> > a word-joiner. Therefore I don't think removing (only) the inserted
> > word-joiners is possible,
> 
> Why not? As it seems to be required for ASS encoding only, all other
> output formats should remain unaffected. 

Because — as written before — it can exist in the original source.
Unicode recommends using the wordjoiner eg to prevent linebreaks
between two characters without any additional side-effects as eg
the combining-grapheme-joiner would cause.

> > but also not necessary.
> 
> I'm not sure whether all ffmpeg text-sub encoders can handle 
> those chars - which could be verified of course.

Since it's in the BMP and ffmpeg already seems happy to assume some UTF-8 
support by converting everything to it, I'm not worried about this until
proven wrong.


> Finally, those chars are a pest. I'm using them myself for a 
> specific use case, but when you don't know they are there, it can
> drive you totally mad, eventually even thinking your system or
> software is faulty.
> 
> Example: 
> 
> Open your patch file [2/2] and search for the string
> "123456\NAscending". You can see the string in two lines, but search
> will only find one of them.
> 
> Or just look at the two lines directly. They are preceded by + and -
> even though both appear identical. 

Actually, I see this with helpful colouring lost here:

  -Dialogue: 0,0:00:55.00,0:01:00.00,Default,,0,0,0,,Descending: 123456\NAscending: 123456^M
  +Dialogue: 0,0:00:55.00,0:01:00.00,Default,,0,0,0,,Descending: <200f>123456<200e>\NAscending: 123456^M

More plain-text oriented editors likely won't show them though, yes.

On this topic, finding raw bidi-marks in ASS subtitles for RTL-languages
is not that unusual though, to give an example for "invisible characters"
being used manually in the original source.
(Because VSFilters (and libass in the interest of compatibility)
 assumes LTR by default and other things)

Even if I thought removing all wordjoiners when converting from ASS
was a good idea, I still wouldn't know where to do this (or where to
look to remove possibly lingering attempts to recollapse \\ into \).


More information about the ffmpeg-devel mailing list