[FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles

wm4 nfxjfg at googlemail.com
Sat Mar 24 18:59:38 EET 2018


On Sat, 24 Mar 2018 17:50:53 +0100
Carl Eugen Hoyos <ceffmpeg at gmail.com> wrote:

> 2018-03-24 17:42 GMT+01:00, wm4 <nfxjfg at googlemail.com>:
> > On Sat, 24 Mar 2018 17:05:41 +0100
> > Carl Eugen Hoyos <ceffmpeg at gmail.com> wrote:
> >  
> >> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg at googlemail.com>:  
> >> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> >> > ASCII characters) were not handled correctly. The spec mandates that
> >> > styling start/end ranges are in "characters". It's not quite clear what
> >> > a "character" is supposed to be, but maybe they mean unicode codepoints.
> >> >
> >> > FFmpeg's decoder treated the style ranges as byte idexes, which could
> >> > lead to UTF-8 sequences being broken, and the common code dropping the
> >> > whole subtitle line.
> >> >
> >> > Change this and count the codepoint instead. This also means that even
> >> > if this is somehow wrong, the decoder won't break UTF-8 sequences
> >> > anymore. The sample which led me to investigate this now appears to work
> >> > correctly.  
> >>
> >> Could you confirm that this is also what QT does?  
> >
> > I can't test with QT. VLC seems to behave like with this patch applied.
> >  
> >> Or is it impossible that the patch breaks something?  
> >
> > Could probably break movtext subtitles generated by ffmpeg (I didn't
> > fix the movtext encoder, and it seems to have the same bug). But these
> > will most likely be broken on other players too. Tough the worst case
> > is just that the styles get shifted.  
> 
> Thank you.
> 
> >> [...]
> >>  
> >> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> >> > *text, const char *text_end,
> >> >              }
> >> >          }
> >> >
> >> > -        switch (*text) {
> >> > -        case '\r':
> >> > -            break;
> >> > -        case '\n':
> >> > -            av_bprintf(buf, "\\N");
> >> > -            break;
> >> > -        default:
> >> > -            av_bprint_chars(buf, *text, 1);
> >> > -            break;
> >> > +        len = get_utf8_length_at(text, text_end);
> >> > +        if (len < 1) {
> >> > +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> >> > subtitle\n");
> >> > +            len = 1;
> >> > +        }
> >> > +        for (i = 0; i < len; i++) {  
> >>  
> >> > +            switch (*text) {
> >> > +            case '\r':
> >> > +                break;
> >> > +            case '\n':
> >> > +                av_bprintf(buf, "\\N");
> >> > +                break;
> >> > +            default:
> >> > +                av_bprint_chars(buf, *text, 1);
> >> > +                break;  
> >>
> >> Imo, the reindentation is not ok but this isn't my code.  
> >
> > Why not?  
> 
> Because the patch is much easier to read without it.

git repo viewers can show commits without whitespaces, so I don't think
it matters anymore for this patch.


More information about the ffmpeg-devel mailing list