[FFmpeg-devel] [PATCH] lavc: support subtitles charset conversion.

Thu Jan 10 23:20:06 CET 2013

Hi,

Michael Niedermayer wrote:
> On Sat, Jan 05, 2013 at 12:54:37PM +0100, Nicolas George wrote:
> > Le quintidi 15 nivôse, an CCXXI, Clement Boesch a écrit :
> > > I fail to see how it is more elegant; the codec properties sounds like the
> > > best place to declare such generalities.
> > 
> > It is hard tu put elegance considerations into words. Looking at the various
> > existing CODEC_CAP, I find they are usually more universal and/or more
> > relevant to the API user, although I realize there are already exceptions.
> > 
> > >					   Using the context structure is
> > > only a necessity if we need on the fly changes, which don't sound common
> > > at all. And if we find such insanity, I'd suggest to fix that mess in the
> > > decoder or the demuxer itself.
> > 
> > What about this, that I thought of this morning:
> > 
> > Sometimes, the recoding will be perforce be done by the demuxer. At other
> > times, it will be done by lavc. In any case, the original encoding should be
> > exposed to the API caller, so that this:
> > 
> > ffmpeg -ss 5 -i file.fmt [ -sub_charenc copy ] shifted_file.fmt
> > 
> > can work. And for convenience and compatibility reasons, it is probably be
> > best if the original encoding is exported in the same field.
> > 
> > Thus my proposal with sub_charenc_mode and the first component that decides
> > it can do the work sets it. That would work like that:
> > 
> > 1. If the demuxer knows the character encoding, it sets sub_charenc.
> > 2. If the demuxer does the recoding, then it sets sub_charenc_mode to DONE,
> >    otherwise it leaves it to its default 0.
> > 3. If mode is still 0, the codec init function sets it to either PRE, POST
> >    or INTERNAL depending on its need.
> > 4. If mode is still 0 after codec init and a character encoding is set, lavc
> >    reports an error.
> 
> All this pre, post, ... stuff sounds rather messy.
> Where is the problem of simply having every public function
> communicate with the "outside" through UTF-8 unicode ?

  +1

  Sorry to bump into the discussion a little late. I agree with Michael
that the pre, post, etc. stuff sounds like a bad idea.

  I will only talk about character set conversion of text subtitles at
the lavc level. To be more precise pre-decoding. I believe this problem
can be treated isolated without taking into account where else the
subtitles might be recoded.

  One valid question is if we need this at all. I think the answer is
yes because i) some demuxers might output subtitle packets that are
not UTF-8 and ii) because lavc may be used without lavf.

  I want to propose a different system:

  We add an AVCodecContext field that is named sub_charenc and is set
by the demuxer/lavc user that identifies the input character encoding
of the text subtitles.

  I imagine things to simply work like this:

  decode_subtitle()
  {
    if (codec is a text subtitle codec && sub_charenc != UTF-8) {
      recode packet from sub_charenc to UTF-8
    }

    // decode (as usual)
  }

  I am eagerly awaiting comments and flames :)

  Please comment *only* on this problem as narrowed down in the
preamle.

  I would try to implement this on top of Clement's work. But as
this is mostly a design issue I thought it would be best to wait
for feedback first.

  IMHO the next important problem is to fix some of the text
subtitle pseudo-demuxers that assume that the input is encoded
in an ASCII-like encoding like UTF-8. Currently I believe the
solution is to share some common routines that allow to read the
stream and do the recoding to UTF-8. This would avoid more serious
rewriting of affected demuxers.

[...]

  Alexander