[MPlayer-dev-eng] [PATCH] Recode legacy metadata (was: Moving towards UTF-8)

Zuxy Meng zuxy.meng at gmail.com
Tue Jun 26 19:04:18 CEST 2007


Hi,

2007/6/25, Reimar Doeffinger <Reimar.Doeffinger at stud.uni-karlsruhe.de>:
> Hello,
> On Mon, Jun 25, 2007 at 04:16:32PM +0800, Zuxy Meng wrote:
> > 2007/6/25, Reimar Doeffinger <Reimar.Doeffinger at stud.uni-karlsruhe.de>:
> > > On Mon, Jun 25, 2007 at 01:23:25PM +0800, Zuxy Meng wrote:
> > > [...]
> > > > +    const char* fallbacks[] = {
> > > > +     "UTF-8",
> > > > +     mp_msg_charset,
> > >
> > > In 90% of cases mp_msg_charset is either set to UTF-8 or to something
> > > that can not be auto-detected. It has no place in this list.
> > > There are two kinds of charsets: Those that can be detected _reliable_ i
> > > most cases with > 4 characters. Those should be in this list.
> >
> > What are the examples of these two kinds of charsets and how to do the
> > audo-detect?
>
> You are doing the auto-detect. Just you are including charsets that
> can't be auto-detected like latin-1.
> The point is, given a random input sequence, iconv will almost always
> indicate an error if you specify UTF-8 as source format. It will never
> fail if you specify latin1 or any similar charset, because all input
> sequences are valid.

It's indeed a problem, which I've discussed with Rich before.
Currently mplayer treats nonascii legacy encoding in metadata as
MSG_CHARSET (default to and most commonly UTF-8) and will invoke iconv
on it. If it fails in the middle, mp_msg() will skip one byte and
restart iconv() from the next byte.

Sometimes it causes problems in displaying such metadata, but how bad
it is depends heavily on the encoding itself. For encodings that are
more or less Latin based (one-byte encoding, ascii mixed with
characters with acute, dieresis, grave, etc.), current behavior is
well acceptable: if mp_msg_charset is the same as that used in
metadata, the user will in many cases see the correct display.

CJK are completely different encodings (two-byte encoding) and current
mplayer behavior will most probably messed up everything if the
metadata is encoded in CJK. The problem is, we can't reliably detect
CJK with iconv only because they overlapped with each other too much.
We have to count on other factors like the # of commonly used
characters vs the # of uncommonly used to make a good guess. More
complex statistic algorithm like that used in enca can detect them
right in most cases (but surely not 100%) but that would be overkill
for a media player.

Putting mp_msg_charset in the list is then a temporary solution while
I assume that a media clip will most probably be enjoyed by people
whose locale matches the metadata encoding:-)

> If you can't find anything with google, just create some random 3, 4, 5
> and 6 byte strings or so and see how often iconv returns an error.
>
> > > > +     ret = iconv(cd, (const char**)&inbuf, &inlen, &outbuf, &outlen);
> > > > +     iconv_close(cd);
> > > > +     if (ret != (size_t)(-1)) {
> > >
> > > I don't think you should treat E2BIG as an error, esp. since you made
> > > the output buffer "only" twice as big as the input one.
> >
> > Hmmm...is there a thearotical upper bound for this? I once assumed
> > that a 2* expansion is safe.
>
> Assuming UTF-8 as output and at least one byte per character input, the current
> theoretical limit is 4 *, though 2* is safe _most_ of the time.
> In theory UTF-8 could be expanded to allow for up to 7* growth, but that
> is unrealistic.

Thanks. I only knew that for Latin1 it's 2*, for CJK it's 1.5*:-)
-- 
Zuxy
Beauty is truth,
While truth is beauty.
PGP KeyID: E8555ED6



More information about the MPlayer-dev-eng mailing list