[FFmpeg-devel] UTF-8/encoding/string handling ideas

Rich Felker dalias
Tue Oct 23 04:24:46 CEST 2007


Based on the UTF-8 decoder bug report thread I started, I have some
ideas for addressing problems in ffmpeg. Basically, the problems seem
to come down to 2 areas:

1. The ffmpeg application is encoding-agnostic and simply passes
strings from the command line (e.g. container metadata) into the
libraries without tagging the character encoding or converting it to
UTF-8.

2. FFmpeg libraries assume text passed to them is UTF-8, but do not
validate UTF-8 provided by the caller before storing it to files, and
may do really bogus things with invalid data (which the caller likely
does not check) if converting to UCS-2/4, UTF-16, etc. for storing to
a file.

I think point #1 is proof that point #2 is a problem. If even the
reference application using the libs gets it wrong and passed
incorrectly encoded or unvalidated data, how can other apps using the
libs be expected to do better?

Here are my ideas towards a solution:

-- Behavior of the libraries (mainly libavformat)

It's good that the libraries expect data to be passed as UTF-8. FFmpeg
does not deal with the locale's text encoding, and rightfully so
because it's dealing with data in files that could include all sorts
of characters not representible in the locale. Let's just make the
parts that use text strings validate the UTF-8 before storing it to
files and generate hard errors if there's anything invalid. Silently
doing substitutions is a bad idea because then you end up with
incorrect files after 12+ hour encoding jobs rather than detecing the
mistake early.

Should the libraries also generate hard errors if the field to be
written only supports ASCII but non-ASCII characters are passed?
Probably... Thoughts?

-- Behavior of ffmpeg application

Here, I think we have several options:

The UTF-8-enforcer part of me wants to say FFmpeg does not want to
have to deal with text encodings, and should just forbid non-ASCII
text input whenever the locale's encoding is not UTF-8. This solution
is simple and robust, and it ensures that the strings passed to the
libraries are always UTF-8 (because ASCII is UTF-8 too) without doing
any conversions. It sounds like something Michael might like too. :)

The let's-be-fair-to-everyone part of me, on the other hand, says some
conversion might be in order. My idea for conversion was to check for
the __STDC_ISO_10646__ macro, and if it's present, use mbrtowc
function to convert the local encoding to wchar_t, then UTF-8 encode
the UCS-4 values in the wchar_t. This avoids having to depend on iconv
or nl_langinfo(CODESET) which are notoriously unreliable on some
platforms. If __STDC_ISO_10646__ is not defined, the behavior would be
to fall back to the big-meanie behavior described before, no non-ASCII
text allowed.

Either way, we end up with UTF-8 strings to pass to the libraries.

Also, note that the 2 approaches aren't mutually exclusive. We could
quickly add a "reject 8-bit octets if the encoding is not UTF-8"
option now to prevent invalid data, and later extend to the second
option.

BTW, one problem: how should we detect whether the locale's encoding
is UTF-8? Is there any good/clean/portable way? :(

Rich




More information about the ffmpeg-devel mailing list