[FFmpeg-devel] [PATCH] Support for UTF8 filenames on Windows

Karl Blomster thefluff
Fri Jun 26 17:29:40 CEST 2009


Ramiro Polla wrote:
> On Fri, Jun 26, 2009 at 11:07 AM, Karl Blomster<thefluff at uppcon.com> wrote:
>> M?ns Rullg?rd wrote:
>>> Karl Blomster <thefluff at uppcon.com> writes:
>>>> Ramiro Polla wrote:
>>>>> On Thu, Jun 25, 2009 at 8:59 AM, Michael
>>>>> Niedermayer<michaelni at gmx.at> wrote:
>>>>>> On Sat, Jun 20, 2009 at 11:56:37PM +0200, Kalle Blomster wrote:
>>>>>>> Currently, ffmpeg on Windows does not support opening files whose
>>>>>>> names
>>>>>>> contain characters that cannot be expressed in the current locale,
>>>>>>> because
>>>>>>> on Windows you can't pass UTF8 in a char* to _open() and have it work.
>>>>>>> You
>>>>>>> have to convert the filename to UTF16 and use _wopen(), which takes a
>>>>>>> wchar_t instead.
>>>>>>>
>>>>>>> I have attached a patch that attempts to solve the problem with a
>>>>>>> rather
>>>>>>> ugly hack. It Works For Me(tm) under mingw at least. Comments are
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Karl Blomster
>>>>>>>  os_support.c |   17 +++++++++++++++++
>>>>>>>  os_support.h |    5 +++++
>>>>>>>  2 files changed, 22 insertions(+)
>>>>>>> 9afa6887f1f6998c37d75efaae5d589918dc752b  ffmpeg_win_utf8_paths.patch
>>>>>>> Index: libavformat/os_support.c
>>>>>>> ===================================================================
>>>>>>> --- libavformat/os_support.c  (revision 19242)
>>>>>>> +++ libavformat/os_support.c  (working copy)
>>>>>>> @@ -30,6 +30,23 @@
>>>>>>>  #include <sys/time.h>
>>>>>>>  #include "os_support.h"
>>>>>>>
>>>>>>> +#ifdef HAVE_WIN_UTF8_PATHS
>>>>>>> +#define WIN32_LEAN_AND_MEAN
>>>>>>> +#include <windows.h>
>>>>>>> +#endif
>>>>>>> +
>>>>>>> +#ifdef HAVE_WIN_UTF8_PATHS
>>>>> Where is HAVE_WIN_UTF8_PATHS defined?
>>>> Nowhere, right now. My thought is to let configure set it with some
>>>> --enable parameter, or you just pass -DHAVE_WIN_UTF8_PATHS in your
>>>> CFLAGS. The point was that I thought it might be a good idea to let
>>>> the user compile with it disabled, if he wanted to, like if someone
>>>> wanted to build on Win9x (heh) or something where unicode support
>>>> might not be available.
>>> Can we simply test for the existence of _wopen()?  Is there any reason
>>> to disable this if the function exists?
>> That may be dangerous. It will always exist in the MinGW includes/libraries,
>> but that doesn't mean it's implemented and works in the runtime libraries
>> you end up using. See also below.
> 
> It this something from msvcrt or from the MinGW runtime libraries?
> FFmpeg already expects minimum mingw-rt and w32api versions.
> 
> If it's because of Win9x users, we already have a couple of places
> that need higher versions of Windows (like a call in getutime in
> ffmpeg.c and inside vfwcap IIRC). I haven't heard of anyone seriously
> using FFmpeg in Win9x and before that happens I don't think we should
> worry about them =)

Then that shouldn't be a concern. If someone else knows of a Windows platform 
where ffmpeg might get used that doesn't support Unicode, I guess they should 
speak up now. :V

>>>>>>> +int winutf8_open(const char *filename, int oflag, int pmode)
>>>>>>> +{
>>>>>>> +     wchar_t wfilename[MAX_PATH * 2];
>>>>>>> +
>>>>>>> +     if
>>>>>>> (MultiByteToWideChar(CP_UTF8,MB_ERR_INVALID_CHARS,filename,-1,wfilename,MAX_PATH)
>>>>>>>> 0)
>>>>>>> +             return _wopen(wfilename, oflag, pmode);
>>>>>>> +     else
>>>>>>> +             return open(filename, oflag, pmode);
>>>>>>> +}
>>>>>>> +#endif
>>> What might cause MultiByteToWideChar() to fail?  What will plain
>>> open() do with such input?  Also, what is the value of MAX_PATH?
>>> It is probably a bad idea to silently truncate the filename at
>>> MAX_PATH characters.  This could turn an invalid name into the name of
>>> an existing file.
>> MultiByteToWideChar() will fail in this case if the input string has
>> characters that cannot be translated as valid UTF8 (since
>> MB_ERR_INVALID_CHARS is specified). This might happen if you have a
>> multi-byte string that isn't UTF8, like for example in the system's local
>> code page (if it's multi-byte). It can also fail if the buffer length is
>> insufficient, or if you lack CP_UTF8, but neither should be a concern here.
>>
>> open() should, as far as I am aware, deal gracefully with multi-byte strings
>> in the system locale, but since it is conceivable that there might be
>> multi-byte characters in the local code page that can be interpreted as
>> valid UTF-8 even though they are not, and considering the fact that the
>> MSVCRT behaves really weirdly with character translations sometimes, the
>> only truly safe option here is to pass only UTF-8 or latin-1; other
>> character sets are not guaranteed to work. Hence my preference for leaving
>> it optional, so people who want UTF-8 filenames on Windows can get them and
>> everyone else can go about their business as usual.
> 
> If it's optional it should be documented and the consequences made clear.

Yes. I can write such documentation if someone points me to where to put it.

>> MAX_PATH is defined to 260 in WinDef.h, and that is actually the maximum
>> allowed path length in the Win32 API unless you want to jump through some
>> hoops. Paths of up to 32,767 characters (approximately) are allowed, but
>> only if they are absolute and start with the magical \\?\ prefix. I guess I
>> could do some detection of relative paths and add said magical prefix
>> manually if so desired, but the static allocation seems safe enough, and the
>> 260 character limit is indeed what a vast majority of Windows programs use.
> 
> Indeed, FFmpeg fails with long names. But if you truncate the long
> name, it might turn into a valid name (like Mans said).

Right, so if strlen(filename) > MAX_PATH, the function should fail? Or should I 
try the long paths workaround? (It will be a minor pain to implement, because 
detecting relative paths on Windows is pretty annoying.)

>> Updated patch with less tabs (and a rather embarrassing typo fix) attached.
>>
>> Regards,
>> Karl Blomster
>>
>> Index: libavformat/os_support.c
>> ===================================================================
>> --- libavformat/os_support.c    (revision 19266)
>> +++ libavformat/os_support.c    (working copy)
>> @@ -30,6 +30,23 @@
>>  #include <sys/time.h>
>>  #include "os_support.h"
>>
>> +#ifdef HAVE_WIN_UTF8_PATHS
>> +#define WIN32_LEAN_AND_MEAN
>> +#include <windows.h>
>> +#endif
>> +
>> +#ifdef HAVE_WIN_UTF8_PATHS
>> +int winutf8_open(const char *filename, int oflag, int pmode)
>> +{
>> +    wchar_t wfilename[MAX_PATH * 2];
> 
> Isn't sizeof(wchar_t) == 2?

Yes (at least on Win32), but characters outside the basic multilingual plane 
requires two UTF-16 code units to express. Of course this is a bit esoteric 
because the likelihood of such characters being used in filenames is very low, 
but in theory it could happen and it's not like allocating 520 extra bytes in a 
temporary buffer is going to kill anyone, so...

> I think you could also use wchar_t wfilename[strlen(filename) + 1]
> instead of malloc if we are going to try and pass paths larger than
> MAX_PATH.

The "proper" way would, I think, be to use
MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, filename, -1, NULL, 0)
first, because that returns the exact number of wide characters required to 
store the string.

Regards,
Karl Blomster



More information about the ffmpeg-devel mailing list