[FFmpeg-devel] [PATCH] Support for UTF8 filenames on Windows

Ramiro Polla ramiro.polla
Sat Jul 18 08:01:48 CEST 2009


On Sat, Jul 18, 2009 at 2:00 AM, Karl Blomster<thefluff at uppcon.com> wrote:
> Ramiro Polla wrote:
>> On Thu, Jul 16, 2009 at 2:55 PM, Karl Blomster<thefluff at uppcon.com> wrote:
>>> Ramiro Polla wrote:
>>>> On Thu, Jul 16, 2009 at 11:20 AM, Karl Blomster<thefluff at uppcon.com>
>>>> wrote:
>>>>> Unless I am severely missing something in your updated patch (thanks
>>>>> for
>>>>> the
>>>>> nice work, by the way!) it will not work with the FFmpeg commandline
>>>>> program. If you want an Unicode commandline in Windows you need to use
>>>>> wmain() or _tmain() instead of plain old main(), AFAIK. As I said
>>>>> earlier
>>>>> my
>>>>> original patch was only intended to let the API support Unicode.
>>>>> Working
>>>>> it
>>>>> into ffmpeg.c would be a lot more work, I think.
>>>>
>>>> How do you test UNICODE support?
>>>>
>>>> I used attached shell file with msys (sh test_unicode.sh) and it works
>>>> as expected (only the unicode filename without FF_WINUTF8 fails). I
>>>> also tested with an app that used Find(First,Next)FileA() and passed
>>>> the unicode filenames as ascii string to ff_winutf8_open() and it also
>>>> worked as expected.
>>>
>>> Plain old cmd.exe (both with and without the chcp 65001 trick). I can do
>>> stuff like notepad.exe <unicode filename> and it'll work fine, but with
>>> ffmpeg it just says file not found (and prints a bogus filename). It
>>> works
>>> fine with mingw's sh; MinGW probably does some kind of black magic there
>>> to
>>> get Unix apps to work without having to patch in the Windows mess. The
>>> API
>>> works fine, of course.
>>
>> Do you know of any real example where a codepage->utf8 conversion
>> fails? I only see some possible theoretical references scattered
>> around the web, but no real examples.
>
> Not sure what you mean here. A given character string in a known codepage
> should always be possible to convert to UTF8, assuming that all the glyphs
> have UTF8 equivalents. I'm not sure if any codepages that aren't fully
> translatable actually exist.
>
>> I'm tempted to do the following:
>> - Always expect filenames in Windows to be passed in UTF8.
>
> This is dangerous for the reasons I mentioned earlier; namely that it isn't
> possible to reliably detect if a given string is UTF8 or not. Lots of
> applications using the ffmpeg API will pass strings in the local codepage,
> and it's theoretically quite possible that a given string in some unknown
> codepage could translate as valid UTF8 while not actually being that. For
> example, the ISO8859-1 string 0xC3 0xA1 (capital letter a with tilde +
> inverted exclamation mark) will translate as valid UTF8, but the result will
> be the single character U+00A1 (inverted exclamation mark) which is
> obviously wrong.
>
> While the likelihood of this actually happening in a real-world filename may
> be low, it's definitely there. In my humble opinion it's big enough to
> justify not turning UTF8 mode on always (despite how much I would like for
> everyone to switch to Unicode), but you're the maintainer, not I.

Oh, I wouldn't want to guess between codepage or UTF-8 or whatever,
that would be a nightmare. I was thinking about documenting "all file
names in Windows *must* be UTF-8 encoded[, unless environment variable
FOO is set]", and let the user of libavformat take care of that
conversion[ or set that variable]. I'm still unsure about the
environment variable.

>> - Always get the Unicode command line and convert it to UTF8.
>
> By all means, go for this if you feel up to it. Personally I was too lazy to
> do it since I didn't really need it myself (I submit patches mostly to
> scratch my own itches) but it would be a nice improvement.

Then assuming filenames that come through the command line are in
UTF-8, we could choose between:

1 - lavf takes in UTF-8. lavf users must convert. No environment
variables. API breakage.
2 - lavf takes in UTF-8 by default, with environment variable to
select system codepage. ffmpeg always overrides that variable to use
UTF-8. API breakage. This would be a nuisance to lavf users who want
to pass filenames from system codepage.
3 - lavf takes in system codepage by default, with environment
variable to select UTF-8. ffmpeg always overrides that variable to use
UTF-8. No API breakage. This would be a nuisance for lavf users who
want to pass UTF-8 filenames.

They're all better than the current "0 - no unicode support".

I'm thinking now of aiming towards 3.

Comments and suggestions are welcome.

Ramiro Polla



More information about the ffmpeg-devel mailing list