[FFmpeg-devel] [PATCH] libavcodec: Do not return encoding errors when -sub_charenc_mode is do_nothing

Fri Aug 30 21:56:10 CEST 2013

On Fri, Aug 30, 2013 at 09:35:33PM +0200, Nicolas George wrote:
> > More importantly, why should we _hinder_ applications from doing
> > their own conversion, in their own way?
> > As far as I can see all that's requested is to allow applications
> > to do their own. We do not force applications to use libavformat
> > to use libavcodec, we do not force them to use libswscale in order
> > to be able to use libavcodec etc., why should they have to use
> > our charset conversion if they want to use libavcodec for subtitle
> > decoding?
> > You say it would result in double conversion after the changes you
> > plan, well ok, but can't you do those changes in a way that allows
> > applications to still just get the data out as it is, instead of forcing
> > them both to use our charset conversion and
> 
> They can. They already can, and I do not intend to change that. All they
> have to do is do it properly, which means (1) taking sub_charenc_mode into
> consideration and updating its value and (2) working on the packet payload
> and not on the decoded text.

I think that is problematic. The raw packet data will usually consist of
mixed data, for example English text from the "container" and some other
language (maybe even a not-ASCII-compatible encoding?).
This makes detecting the encoding via a dictionary much more difficult.
The more extreme example would be SRT with comment lines containing the
original text in one encoding and the actual subtitles in a different
encoding.
Yes, I have not seen this in practice, however I hope you at least
agree that you cannot do much with the packet-level data here unless
you re-implement have the subtitle decoder.
All this is ignoring practical considerations like applications that
simply have well working encoding detection that is possibly fine-tuned
to their specific region/users and that don't want to spend time
and effort to make FFmpeg similarly well. Both what I said above
and simply architectural limitations are reasons to not apply
the charset detection code to the packets.
Also, for what you describe with "taking sub_charenc_mode into
consideration and updating its value", how does that allow to
make FFmpeg _not_ do the charset conversion?
Maybe because FFmpeg does not know the charset you need,
it's conversion is too slow for your tastes, or doesn't
have some other features you want (for example, if you know
the language you can easily support mixed latin1 and UTF-8
files, but I don't think we want to go there inside FFmpeg).