[FFmpeg-devel] Discussion: Feature: Subtitle charenc detection

Thu Oct 23 13:56:54 CEST 2014

Le duodi 2 brumaire, an CCXXIII, Rodger Combs a écrit :
> As mentioned in https://trac.ffmpeg.org/ticket/4054#comment:1

Let me quote for completeness:

11rcombs:
>>> Sometimes, especially when ffmpeg is being called programmatically, it is
>>> difficult or impossible for the caller (or user) to know the character
>>> encoding of a subtitle file. It'd be useful for libavformat to provide a
>>> mechanism to detect the encoding if an option is set, using some
>>> combination of universalchardet, enca, or libguess.

gjdfgh:
>> Some things to note:
>> * no subtitle charset detector is good/sufficient, and you will always
>>   have the situation in which you have multiple guesses, and you want the
>>   user to select which guess, etc.
>> * I think it's wrong to add detection directly to (or below) the subtitle
>>   demuxers - instead, maybe there should be a function to guess subtitle
>>   codec from a list of packets (you could provide a convenience function
>>   which does that using the libavformat internal packet queue)
>> * the actual subtitle conversion should be somewhere else too, and maybe
>>   work on the packets (or you could set it as sub charset option in
>>   libavcodec, forgot the option name) 
>> Also, this should probably be discussed on the mailing list. The bug
>> tracker sucks for this purpose.

> There are a lot of nuances to this, it'll require linking at least one
> (and possibly 3 or more) new dependencies, and it'll probably require at
> least some changes to existing subtitle decoders.

AFAIK, the problem only happens with stand-alone text subtitles files.
Formats that support muxed text subtitles usually specify the character
encoding.

For stand-alone text files, the best approach IMHO is to have an API to just
read text files, taking care of all annoying details (such as encoding, but
not only: line endings, BOM, etc.), and the symmetric API for writing.

The subtitles demuxers would only need to use that API, which is not very
difficult as they already all use common code to read entire files.

The API would also benefit other places in the code, like for the textfile
option for the drawtext filter.

I had a proposal some time ago, but it did not have all the promised bells
and whistles yet and was taken by so much bikeshedding that I had to put it
on hold indefinitely.

Concerning the specific issue of detecting the encoding, I believe a
pluggable API is best: even if FFmpeg is built with only the basic internal
heuristics, the application can provide support for
libomniscientcharsetguess.

Regards,

-- 
  Nicolas George
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20141023/22dde2a4/attachment.asc>