[FFmpeg-devel] [PATCH] Support for UTF8 filenames on Windows

Sat Jul 18 07:00:22 CEST 2009

Ramiro Polla wrote:
> Hi,
> 
> On Thu, Jul 16, 2009 at 2:55 PM, Karl Blomster<thefluff at uppcon.com> wrote:
>> Ramiro Polla wrote:
>>> On Thu, Jul 16, 2009 at 11:20 AM, Karl Blomster<thefluff at uppcon.com>
>>> wrote:
>>>> Unless I am severely missing something in your updated patch (thanks for
>>>> the
>>>> nice work, by the way!) it will not work with the FFmpeg commandline
>>>> program. If you want an Unicode commandline in Windows you need to use
>>>> wmain() or _tmain() instead of plain old main(), AFAIK. As I said earlier
>>>> my
>>>> original patch was only intended to let the API support Unicode. Working
>>>> it
>>>> into ffmpeg.c would be a lot more work, I think.
>>> How do you test UNICODE support?
>>>
>>> I used attached shell file with msys (sh test_unicode.sh) and it works
>>> as expected (only the unicode filename without FF_WINUTF8 fails). I
>>> also tested with an app that used Find(First,Next)FileA() and passed
>>> the unicode filenames as ascii string to ff_winutf8_open() and it also
>>> worked as expected.
>> Plain old cmd.exe (both with and without the chcp 65001 trick). I can do
>> stuff like notepad.exe <unicode filename> and it'll work fine, but with
>> ffmpeg it just says file not found (and prints a bogus filename). It works
>> fine with mingw's sh; MinGW probably does some kind of black magic there to
>> get Unix apps to work without having to patch in the Windows mess. The API
>> works fine, of course.
> 
> Do you know of any real example where a codepage->utf8 conversion
> fails? I only see some possible theoretical references scattered
> around the web, but no real examples.

Not sure what you mean here. A given character string in a known codepage should 
always be possible to convert to UTF8, assuming that all the glyphs have UTF8 
equivalents. I'm not sure if any codepages that aren't fully translatable 
actually exist.

> I'm tempted to do the following:
> - Always expect filenames in Windows to be passed in UTF8.

This is dangerous for the reasons I mentioned earlier; namely that it isn't 
possible to reliably detect if a given string is UTF8 or not. Lots of 
applications using the ffmpeg API will pass strings in the local codepage, and 
it's theoretically quite possible that a given string in some unknown codepage 
could translate as valid UTF8 while not actually being that. For example, the 
ISO8859-1 string 0xC3 0xA1 (capital letter a with tilde + inverted exclamation 
mark) will translate as valid UTF8, but the result will be the single character 
U+00A1 (inverted exclamation mark) which is obviously wrong.

While the likelihood of this actually happening in a real-world filename may be 
low, it's definitely there. In my humble opinion it's big enough to justify not 
turning UTF8 mode on always (despite how much I would like for everyone to 
switch to Unicode), but you're the maintainer, not I.

> - Always get the Unicode command line and convert it to UTF8.

By all means, go for this if you feel up to it. Personally I was too lazy to do 
it since I didn't really need it myself (I submit patches mostly to scratch my 
own itches) but it would be a nice improvement.

> And this is the information I've gathered from comments and
> suggestions and asking around some Win32 RE guys but no real hard
> facts or MSDN documentation:
> - Windows file system APIs use UTF-16 internally, so any codepage that
> can't be converted to UTF-16 will be a problem anyways and we
> shouldn't worry about it.
> - UTF16->UTF8 conversion might be lossless (some suggest the extra
> characters in codepages that can't be represented in unicode are being
> assigned invalid unicode values).

Yeah, I wouldn't worry about UTF8<->UTF16 translations.

Regards,
Karl Blomster