[MPlayer-dev-eng] a patch to show Chinese characters correctly in terminal

Jehan Pagès jehan.marmottard at gmail.com
Sun Apr 4 10:49:03 CEST 2010


Hi,

there are several issues in your code but also in the simple concept
of this code. See below.

2010/4/4 ideal <idealities at gmail.com>:
> @@ -1318,19 +1319,33 @@
>  {
>     char **info = demuxer->info;
>     int n;
> +    char *in, *tag, *temp;
> +    size_t ilen, olen, len;
> +    char *tocode = strchr(getenv("LANG"), '.') + 1;

This is not the right way to get the codeset. The best way is:
char *tocode = nl_langinfo (CODESET);

(I checked in mplayer code, the langinfo.h file can be included in
configure. Its existence can be actually tested in the code with
#ifdef HAVE_LANGINFO)

IF this does not exist, I think that it is best to use setlocale than
getenv (though I am not sure of this point and of the difference of
both functions in the simple read use case, maybe there are none). But
what is for sure is that in logics of environment variable, you must
first test LC_ALL, if it is null, then LC_CTYPE, and then only if both
are null LANG (cf. "man 7 locale").

And finally, in your strchr, you should remove also any text after a
possible '@'. Actually a pretty common naming is using the '@' to tell
that it is some kind of modified locale. For instance, it is common to
see such a locale: "fr_FR.ISO-8859-16 at euro" (though now it is less
common as everybody switches to UTF-8. But the fact is that it has
existed and it probably still does in some places). And the 'euro'
sign naming convention was not the only case where such a '@' is used
in locales. So you must take care of this after-arobas part, not only
the before-dot part.

As a conclusion, note that the only good reliable way to get the
current codeset I know (if you know others, just tell me) is
"nl_langinfo (CODESET)" (which is equivalent as typing "locale
charmap" in console). Any other I explained above are potentially
wrong workarounds though so many programs use them as the primary
method. In fact if you look how are constructed locales, you would see
that the naming is only a convention so you could name a locale
whatever. I once made the test, like make a locale (really functional)
named "toto". Only a "locale charmap"/nl_langinfo could then tell the
right codeset (fortunately distributions are following the common
sense naming convention and not making crappy name, but it does not
mean we should rely on these names).

> +    iconv_t cd = iconv_open(tocode, "GB18030");

There is a slight issue here which is that cd keeps the conversion
state from a call to another. So for instance if the origin encoding
is using states, it could break subsequent tags.
As far as I can see,GB18030 is not using encoding states but it is not
a fixed sized codeset. So if ever one of the tag finishes badly, it
will also break the further tag decoding. Example: Wikipedia tells
that the euro sign is using a single byte code in GBK but a 2-byte
code in GB18030, which means that if you had a euro's gbk (and you
considered wrongly the encoding being GB18030), then first the
conversion of euro will be wrong, second if the tag finished by this
euro, the next tags will be wrong as well (as cd keeps it conversion
state, which will probably keep the pending euro byte).
So to be fair, you would need to reinitialize the cd for each tag, as
their texts are considered independant (they are, no?).


> But the problem is that it didn't try to detect what the encoding of the
> song's id3 tag (I will try to solve this problem), though it does not affect
> if the song's tag is all in English.
>

Ok here is the conceptual part which is the real issue here (the code
is fixable). I don't know if there exists libs that detects the
encoding of a string. But if it exists, this cannot be reliable. The
problem is not that the strings in an ID3 tag are small so we cannot
be sure which of a codeset of a super-codeset is the right one.
Somehow it doesn't matter if both were ok. But what when two codesets
are using the same numeric range without being supersets of another?
Or even simply are intersecting (if you don't go out this intersection
in your string sample, it is impossible to figure out which one it
is... unless you try some natural language recognition, but I doubt
this is to be done in mplayer :p).

And as a very simple example of what is wrong without good detection
of the initial string: you write « it does not affect if the song's
tag is all in English. » but if it is UTF-8 with non-ASCII characters,
you are completely breaking the string as you are making a conversion
from an UTF-8 string that you consider by default as GB18030 (which is
not a subset, nor superset of UTF-8) to UTF-8.
So you indeed break UTF-8 encoding in order to work around few cases
with GB18030 encoding.

The only reliable way I would see for the issue you point would be an
ASCII ID3 tag telling in which codeset has been written the other ID3
tags. Does such a standard tag exist? If yes, it would be indeed nice
to implement such a detection. And if a media file does not use this
ID3 tag, the bug will be in the media file, not mplayer's. Because the
current non-detection system provided here is a UTF-8 breaker and even
a replacement detection system would be unreliable.

> Thanks!

You are welcome, though I guess this was not the answer you hoped. But
I am not a maintainer here so my writing is not gold. I just give my
opinion/knowledge (which may still be wrong sometimes :p).

Bye.

Jehan



More information about the MPlayer-dev-eng mailing list