[MPlayer-dev-eng] Moving towards UTF-8

Zuxy Meng zuxy.meng at gmail.com
Mon Oct 23 09:43:15 CEST 2006


2006/10/23, Rich Felker <dalias at aerifal.cx>:
> > argued that the process should be more intelligent. Then a temporary
> > solution would be: in mp_msg, when iconv() fail, instead of going for
> > next char, it should bail out and print the rest of the string as is.
>
> This is an ugly hack. Instead the conversion should happen when
> loading the metadata. The demuxer should first try parsing it as UTF-8
> (almost sure to fail for non-UTF-8 strings), then in several other
> charsets:
>
> - charset of the user's locale
> - popular CJK encodings
> - latin-1
>
> This list is not necessarily optimal. A better approach would probably
> be to have a config variable that's an ordered list of charsets to
> try, with the default ordered to pick up cjk charsets when possible
> while avoiding too many false positives.
>
> > Then for GBK encoded Chinese, more than 80% the case, the string won't
> > be a legal UTF-8 symbol and hence the user will see the correct,
> > unconverted string.
>
> Unacceptable. If the string is GBK but the user has a UTF-8 system, it
> will print nonsense to the terminal (possibly even corrupt terminal
> control sequences). Maybe now this is rare, but eventually everyone
> will be using UTF-8. Conversion must never be bypassed.

Well, currently, if the string is in GBK but MSG_CHARSET != GBK, then
the user has no chance to get anything sane on the terminal,
regardless of his/her locale, because mp_msg() converts the string at
its best effort: it'll jump to next byte if the previous one has
failed, while GBK is a two-byte encoding....

-- 
Zuxy
Beauty is truth,
While truth is beauty.
PGP KeyID: E8555ED6



More information about the MPlayer-dev-eng mailing list