[MPlayer-dev-eng] Moving towards UTF-8
Rich Felker
dalias at aerifal.cx
Mon Oct 23 09:45:20 CEST 2006
On Mon, Oct 23, 2006 at 02:10:12PM +0800, Zuxy Meng wrote:
> Hi,
>
> I guess we've agreed that all internal strings inside mplayer are to
> be encoded in utf-8. Absolutely good news, especially for CJK users
> like me. But several things must be done or CJK users will most
> probabilly see messed up strings:
>
> 1. Filenames must be passed to fopen() as is, so maybe they shouldn't
Well the fundamental problem is that legacy charsets in the filesystem
need to be replaced by UTF-8, but unfortunately MPlayer can't do that
by itself...
> be stored as utf-8, and a mp_msg_noconv() should be introduced?
They should be _stored_ as raw byte sequences but translated from the
locale's charset to UTF-8 before being passed to mp_msg (which will
convert them back to the locale's charset :). This double-conversion
may seem wasteful but it's much cleaner. mp_msg_noconv() isn't viable
because the format string might contain non-ASCII characters, and
breaking messages up into multiple mp_msg calls seems like a very bad
idea since (for instance) a GUI implementation might want to display a
complete message as one unit.
> 2. ASF files' meta data are stored in utf-16le, they should be
> properly converted to utf-8 instead of simply being shrank. As
> mentioned in another thread I'll attack this.
Yep, fixing this is straightforward I think.
> 3. Most challenging thing will be meta data stored in legacy encoding,
> like id3tag. Quite absurd, if the user doesn't bother to set
> mp_msg_charset, like most guys under Windows, s/he will probably see
> the correct string if it happens to be encoded in her/his locale,
> because it's printed unconverted; but if s/he or mplayer set
> mp_msg_charset correctly, s/he will surely see mess.
>
> For 3, my proposal was to treat such meta data as encoded in
> mp_msg_charset (I assumed that people tend to listen more songs in
> their own language than in foreign languages :-)). Rich disagreed and
Yes. For example I have tons of mp3s with shift_jis in the id3 tags,
and it would be nice if it actually printed correctly.
> argued that the process should be more intelligent. Then a temporary
> solution would be: in mp_msg, when iconv() fail, instead of going for
> next char, it should bail out and print the rest of the string as is.
This is an ugly hack. Instead the conversion should happen when
loading the metadata. The demuxer should first try parsing it as UTF-8
(almost sure to fail for non-UTF-8 strings), then in several other
charsets:
- charset of the user's locale
- popular CJK encodings
- latin-1
This list is not necessarily optimal. A better approach would probably
be to have a config variable that's an ordered list of charsets to
try, with the default ordered to pick up cjk charsets when possible
while avoiding too many false positives.
> Then for GBK encoded Chinese, more than 80% the case, the string won't
> be a legal UTF-8 symbol and hence the user will see the correct,
> unconverted string.
Unacceptable. If the string is GBK but the user has a UTF-8 system, it
will print nonsense to the terminal (possibly even corrupt terminal
control sequences). Maybe now this is rare, but eventually everyone
will be using UTF-8. Conversion must never be bypassed.
Rich
More information about the MPlayer-dev-eng
mailing list