[FFmpeg-devel] [PATCH] "Mojibake" in Japanese

Tue Feb 14 23:52:23 CET 2012

Hi Michael Niedermayer!

 On 2012.02.14 at 22:36:04 +0100, Michael Niedermayer wrote next:

> It is maybe easier for the end user of a package if it could be
> selected at runtime.
> For example with a environment variable.
> 
> Still better would be if autodetection could be done, is there some
> readily available software (like iconv) that can guess from a char*
> what encoding is used ?

I can answer this, and the answer is "no". From experience, even
detection among few cyrillic encodings (based on letter frequencies) is
often impossible if all you have is two or three words. It just won't
work correctly. And that's when you are sure it's single-byte encoding,
if you add possibility of multibyte Japanese and some other into the
mix, it becomes impossible to detect anything on something as short as
typical ID3 tag.

Browsers choke on detecting encoding if none is specified sometimes even
when whole page of text is provided; of course, if you have a page you
can pick between cyrillic encodings, but when browser considers
also possiblities of multibyte encodings of asian languages it often
gives totally bogus results. So, even if you think about combining all
the ID3 tags and using that for detection, it still won't work
correctly.

-- 

Vladimir