[FFmpeg-devel] [PATCH] "Mojibake" in Japanese

Michael Niedermayer michaelni at gmx.at
Wed Feb 15 04:53:14 CET 2012

Hi Vladimir

On Wed, Feb 15, 2012 at 02:52:23AM +0400, Vladimir Mosgalin wrote:
> Hi Michael Niedermayer!
>  On 2012.02.14 at 22:36:04 +0100, Michael Niedermayer wrote next:
> > It is maybe easier for the end user of a package if it could be
> > selected at runtime.
> > For example with a environment variable.
> > 
> > Still better would be if autodetection could be done, is there some
> > readily available software (like iconv) that can guess from a char*
> > what encoding is used ?
> I can answer this, and the answer is "no". From experience, even
> detection among few cyrillic encodings (based on letter frequencies) is
> often impossible if all you have is two or three words. It just won't
> work correctly. And that's when you are sure it's single-byte encoding,
> if you add possibility of multibyte Japanese and some other into the
> mix, it becomes impossible to detect anything on something as short as
> typical ID3 tag.

it could in theory be done by looking the decodings up in a database
of artists and titles. I doubt a false encoding would lead to a better
match on all fields of a ID3 tag than the correct encoding.

> Browsers choke on detecting encoding if none is specified sometimes even
> when whole page of text is provided; of course, if you have a page you

no doubt they do but this isnt a strong statement on the difficulty
of the problem. We dont know how much time and effort the browsers
developers have put in writing the encoding guessing code.
Someone could just as well argue that its hard because ffmpeg doesnt
get it right.

That said ATM i dont see many reasonable ways for us to guess the
encoding either.
Using a offline database of music titles is a clear no-way as much as
qerrying a online database is (privacy issues here)
one thing that could be tried would be feeding trial decoded strings
to a spell checker if one is installed. If its wordlist is complete
enough it might work out in detecting the correct encoding.
Though iam not sure this is reasonable but if the code is clean and
obviously optional id apply a patch that adds such guessing feature.

Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Let us carefully observe those good qualities wherein our enemies excel us
and endeavor to excel them, by avoiding what is faulty, and imitating what
is excellent in them. -- Plutarch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120215/0e79fba1/attachment.asc>

More information about the ffmpeg-devel mailing list