[FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles
wm4
nfxjfg at googlemail.com
Sat Mar 24 18:59:38 EET 2018
On Sat, 24 Mar 2018 17:50:53 +0100
Carl Eugen Hoyos <ceffmpeg at gmail.com> wrote:
> 2018-03-24 17:42 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
> > On Sat, 24 Mar 2018 17:05:41 +0100
> > Carl Eugen Hoyos <ceffmpeg at gmail.com> wrote:
> >
> >> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
> >> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> >> > ASCII characters) were not handled correctly. The spec mandates that
> >> > styling start/end ranges are in "characters". It's not quite clear what
> >> > a "character" is supposed to be, but maybe they mean unicode codepoints.
> >> >
> >> > FFmpeg's decoder treated the style ranges as byte idexes, which could
> >> > lead to UTF-8 sequences being broken, and the common code dropping the
> >> > whole subtitle line.
> >> >
> >> > Change this and count the codepoint instead. This also means that even
> >> > if this is somehow wrong, the decoder won't break UTF-8 sequences
> >> > anymore. The sample which led me to investigate this now appears to work
> >> > correctly.
> >>
> >> Could you confirm that this is also what QT does?
> >
> > I can't test with QT. VLC seems to behave like with this patch applied.
> >
> >> Or is it impossible that the patch breaks something?
> >
> > Could probably break movtext subtitles generated by ffmpeg (I didn't
> > fix the movtext encoder, and it seems to have the same bug). But these
> > will most likely be broken on other players too. Tough the worst case
> > is just that the styles get shifted.
>
> Thank you.
>
> >> [...]
> >>
> >> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> >> > *text, const char *text_end,
> >> > }
> >> > }
> >> >
> >> > - switch (*text) {
> >> > - case '\r':
> >> > - break;
> >> > - case '\n':
> >> > - av_bprintf(buf, "\\N");
> >> > - break;
> >> > - default:
> >> > - av_bprint_chars(buf, *text, 1);
> >> > - break;
> >> > + len = get_utf8_length_at(text, text_end);
> >> > + if (len < 1) {
> >> > + av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> >> > subtitle\n");
> >> > + len = 1;
> >> > + }
> >> > + for (i = 0; i < len; i++) {
> >>
> >> > + switch (*text) {
> >> > + case '\r':
> >> > + break;
> >> > + case '\n':
> >> > + av_bprintf(buf, "\\N");
> >> > + break;
> >> > + default:
> >> > + av_bprint_chars(buf, *text, 1);
> >> > + break;
> >>
> >> Imo, the reindentation is not ok but this isn't my code.
> >
> > Why not?
>
> Because the patch is much easier to read without it.
git repo viewers can show commits without whitespaces, so I don't think
it matters anymore for this patch.
More information about the ffmpeg-devel
mailing list