[FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

Sat Dec 13 12:34:28 CET 2014

Le duodi 22 frimaire, an CCXXIII, Rodger Combs a écrit :
> This also moves general charenc conversion from avcodec to avformat;
> the version in avcodec is left, but renamed; I'm not sure if that's
> the optimal solution.
> 
> The documentation could probably use some improvements, and a few more
> options could be added to ENCA.
> 
> This very simply prefers libguess over ENCA, and ENCA over uchardet, but
> will fall back on a less-preferred guess if something decodes wrong, and will
> drop illegal sequences in iconv if all else fails.
> 
> It'd be possible to have ffmpeg.c present a UI if multiple guesses are
> returned, and other library consumers could do the same.

So, now that I have a decent connection and time, here are some comments:

First, your patch seems to happen after the text demuxers have parsed the
text files. Therefore, this can not work for non-ASCII-compatible encodings,
such as UTF-16. You might say that UTF-16 already works, but its
implementation is bogus and leads to user-visible problems (see trac ticket
#4059). But even if it was not, we would not want two competing detection
layers.

More importantly: the lavc API is ready to handle situations where the
recoding has been done by the demuxer. See the doxy for sub_charenc_mode and
the associated constants. So if you are discarding it or adding competing
fields, you are certainly missing something on the proper use of the API. Of
course, if you think the API is not actually ready, feel free to explain and
discuss your point.

Third point: detection is not something that works well, and people will
frequently find versions of FFmpeg built without their favourite library.
For both these reasons, applications using the library should be able to
provide their own detection mechanism to complement or even replace the ones
hardcoded in FFmpeg. Same goes for conversion, even if it is not as
important.

Fourth and last point: detecting text encoding is not useful only for text
subtitles formats, other features may need it: filter graph files (think of
drawtext options), ffmetadata files, etc.

Here is the API I am considering. I had started to implement it until
bickering and lack of enthusiasm discouraged me.

The work happens in lavu, and is therefore available everywhere, replacing
av_file_map() whenever it is used for text files. It is an API for reading
text files / buffers / streams, taking care of all the gory details. Text
encoding, of course, but also the LF / CRLF mess, possibly splitting lines
at the same time, maybe normalizing spaces, etc.

The text-file-read API is controlled with a context parameter, holding
amongst other things a list of "detection modules", and also "recoding
modules". Detection modules are just a structure with a callback. FFmpeg
provides built-in modules, such as your proposed libguess, libenca and
libuchardet code, but applications can also create their own modules.

Then it is just a matter of changing the subtitle-specific FFTextReader API
to use the new lavu text-file-read API.

I hope this helps.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20141213/c00b6717/attachment.asc>