[MPlayer-dev-eng] [PATCH] Recode legacy metadata (was: Moving towards UTF-8)

Reimar Doeffinger Reimar.Doeffinger at stud.uni-karlsruhe.de
Mon Jun 25 10:40:32 CEST 2007


Hello,
On Mon, Jun 25, 2007 at 04:16:32PM +0800, Zuxy Meng wrote:
> 2007/6/25, Reimar Doeffinger <Reimar.Doeffinger at stud.uni-karlsruhe.de>:
> > On Mon, Jun 25, 2007 at 01:23:25PM +0800, Zuxy Meng wrote:
> > [...]
> > > +    const char* fallbacks[] = {
> > > +     "UTF-8",
> > > +     mp_msg_charset,
> >
> > In 90% of cases mp_msg_charset is either set to UTF-8 or to something
> > that can not be auto-detected. It has no place in this list.
> > There are two kinds of charsets: Those that can be detected _reliable_ i
> > most cases with > 4 characters. Those should be in this list.
> 
> What are the examples of these two kinds of charsets and how to do the
> audo-detect?

You are doing the auto-detect. Just you are including charsets that
can't be auto-detected like latin-1.
The point is, given a random input sequence, iconv will almost always
indicate an error if you specify UTF-8 as source format. It will never
fail if you specify latin1 or any similar charset, because all input
sequences are valid.
If you can't find anything with google, just create some random 3, 4, 5
and 6 byte strings or so and see how often iconv returns an error.

> > > +     ret = iconv(cd, (const char**)&inbuf, &inlen, &outbuf, &outlen);
> > > +     iconv_close(cd);
> > > +     if (ret != (size_t)(-1)) {
> >
> > I don't think you should treat E2BIG as an error, esp. since you made
> > the output buffer "only" twice as big as the input one.
> 
> Hmmm...is there a thearotical upper bound for this? I once assumed
> that a 2* expansion is safe.

Assuming UTF-8 as output and at least one byte per character input, the current
theoretical limit is 4 *, though 2* is safe _most_ of the time.
In theory UTF-8 could be expanded to allow for up to 7* growth, but that
is unrealistic.

Greetings,
Reimar Doeffinger



More information about the MPlayer-dev-eng mailing list