[FFmpeg-devel] [PATCH] "Mojibake" in Japanese
tetu at eth0.jp
Thu Feb 23 23:01:25 CET 2012
Hi Michael Niedermayer!
> It is maybe easier for the end user of a package if it could be
> selected at runtime.
> For example with a environment variable.
Thanks for your idea!
I rewrote patch with a environment variable.
> Still better would be if autodetection could be done, is there some
> readily available software (like iconv) that can guess from a char*
> what encoding is used ?
Do you know the juniversalchardet?
universalchardet is the encoding detector extension of Mozilla Firefox.
juniversalchardet is librarized-universalchardet.
but I have never used it.
I think it is not precise auto detect.
because ID3 tags is very short string.
2012/2/15 Michael Niedermayer <michaelni at gmx.at>
> Hi Vladimir
> On Wed, Feb 15, 2012 at 02:52:23AM +0400, Vladimir Mosgalin wrote:
> > Hi Michael Niedermayer!
> > On 2012.02.14 at 22:36:04 +0100, Michael Niedermayer wrote next:
> > > It is maybe easier for the end user of a package if it could be
> > > selected at runtime.
> > > For example with a environment variable.
> > >
> > > Still better would be if autodetection could be done, is there some
> > > readily available software (like iconv) that can guess from a char*
> > > what encoding is used ?
> > I can answer this, and the answer is "no". From experience, even
> > detection among few cyrillic encodings (based on letter frequencies) is
> > often impossible if all you have is two or three words. It just won't
> > work correctly. And that's when you are sure it's single-byte encoding,
> > if you add possibility of multibyte Japanese and some other into the
> > mix, it becomes impossible to detect anything on something as short as
> > typical ID3 tag.
> it could in theory be done by looking the decodings up in a database
> of artists and titles. I doubt a false encoding would lead to a better
> match on all fields of a ID3 tag than the correct encoding.
> > Browsers choke on detecting encoding if none is specified sometimes even
> > when whole page of text is provided; of course, if you have a page you
> no doubt they do but this isnt a strong statement on the difficulty
> of the problem. We dont know how much time and effort the browsers
> developers have put in writing the encoding guessing code.
> Someone could just as well argue that its hard because ffmpeg doesnt
> get it right.
> That said ATM i dont see many reasonable ways for us to guess the
> encoding either.
> Using a offline database of music titles is a clear no-way as much as
> qerrying a online database is (privacy issues here)
> one thing that could be tried would be feeding trial decoded strings
> to a spell checker if one is installed. If its wordlist is complete
> enough it might work out in detecting the correct encoding.
> Though iam not sure this is reasonable but if the code is clean and
> obviously optional id apply a patch that adds such guessing feature.
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> Let us carefully observe those good qualities wherein our enemies excel us
> and endeavor to excel them, by avoiding what is faulty, and imitating what
> is excellent in them. -- Plutarch
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5509 bytes
Desc: not available
More information about the ffmpeg-devel