[FFmpeg-devel] [PATCH] avformat: Implement subtitle charenc guessing

Rodger Combs rodger.combs at gmail.com
Sat Dec 13 13:55:55 CET 2014


> On Dec 13, 2014, at 05:34, Nicolas George <george at nsup.org> wrote:
> 
> So, now that I have a decent connection and time, here are some comments:
> 
> First, your patch seems to happen after the text demuxers have parsed the
> text files. Therefore, this can not work for non-ASCII-compatible encodings,
> such as UTF-16. You might say that UTF-16 already works, but its
> implementation is bogus and leads to user-visible problems (see trac ticket
> #4059). But even if it was not, we would not want two competing detection
> layers.
> 

Agreed, a single layer would be preferable.

> More importantly: the lavc API is ready to handle situations where the
> recoding has been done by the demuxer. See the doxy for sub_charenc_mode and
> the associated constants. So if you are discarding it or adding competing
> fields, you are certainly missing something on the proper use of the API. Of
> course, if you think the API is not actually ready, feel free to explain and
> discuss your point.
> 

I couldn't see a sensible way to do this in lavc, since the detector libraries generally require more than one packet to work effectively. Looking at that doxy again, I can see how the detection could be done in lavf and the conversion in lavc, but I don't really see an advantage there other than fewer API changes.

> Third point: detection is not something that works well, and people will
> frequently find versions of FFmpeg built without their favourite library.
> For both these reasons, applications using the library should be able to
> provide their own detection mechanism to complement or even replace the ones
> hardcoded in FFmpeg. Same goes for conversion, even if it is not as
> important.
> 

Yeah, a modular approach would be excellent.

> Fourth and last point: detecting text encoding is not useful only for text
> subtitles formats, other features may need it: filter graph files (think of
> drawtext options), ffmetadata files, etc.
> 
> Here is the API I am considering. I had started to implement it until
> bickering and lack of enthusiasm discouraged me.
> 
> The work happens in lavu, and is therefore available everywhere, replacing
> av_file_map() whenever it is used for text files. It is an API for reading
> text files / buffers / streams, taking care of all the gory details. Text
> encoding, of course, but also the LF / CRLF mess, possibly splitting lines
> at the same time, maybe normalizing spaces, etc.
> 

So, by default it'd just handle encoding, and then additional normalization features could be enabled by the consumer? Sounds useful indeed.

> The text-file-read API is controlled with a context parameter, holding
> amongst other things a list of "detection modules", and also "recoding
> modules". Detection modules are just a structure with a callback. FFmpeg
> provides built-in modules, such as your proposed libguess, libenca and
> libuchardet code, but applications can also create their own modules.
> 

I like this model in general, but it brings up a few questions that I kind of dodged in my patch. For instance, how should lavu determine which module's output to prefer if there are conflicting charenc guesses? How can the consumer choose between the given guesses?
In my patch, preference is very simplistic and the order is hardcoded. In a more modular system, it'd have to be a bit more complex; I can imagine some form of scoring system, or even another type of module that ranks possible guesses, but that could get very complex very fast. Any ideas for this?
In my patch, the consumer can override the choice of encoding by making changes to the AVFormatContext between "header reading" and retrieving the packet; it seems like the best way to do so in your system would be to pass a callback.

On a bit of a side-note: my system is designed to make every possible effort to return a recoded packet, with multiple layers of fallback behavior in case the first guess turns out to be incorrect or the source file is outright invalid. I wouldn't expect that to be significantly more difficult with your design, but I wonder what your opinions on the setup are?

> Then it is just a matter of changing the subtitle-specific FFTextReader API
> to use the new lavu text-file-read API.
> 

So, the text-file-read API would buffer the entire input file and perform charenc detection/conversion and/or other normalization, then FFTextReader would read from the normalized buffer?

> I hope this helps.
> 
> Regards,
> 
> -- 
>  Nicolas George
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>


More information about the ffmpeg-devel mailing list