[MPlayer-dev-eng] [PATCH] Libass BOM fix
Ulion
ulion2002 at gmail.com
Sat Oct 27 15:36:48 CEST 2007
2007/10/27, Evgeniy Stepanov <eugeni.stepanov at gmail.com>:
> On Saturday 27 October 2007 16:49:04 Ulion wrote:
> > The BOM header for utf-16 is wild used at least by windows. Try save a
> > file with 'unicode' or 'unicode big endian' encoding by notepad, you
> > will get utf-16 encoded file with BOM header.
> > I think the BOM should override the codepage value since the BOM is
> > hardcoded encoding marker for utf-16. There's only one subcp we can
> > set but more than one subtitle file will use that setting. As me, I
> > set subcp=cp936 as my default setting, that will cause failing to load
> > this ass forever unless do a BOM fix.
> > About BOM: http://en.wikipedia.org/wiki/Byte_Order_Mark
> > I think this patch will not break things, it fix things.
>
> I know what BOM is, but it applies only to unicode. In lots of other
> encodings, both 0xFF and 0xFE are valid characters. They can appear in the
> beginning of the file. Then, with your patch, this file will be mistakenly
> identified as UCS, overriding subcp setting.
I don't thinks there's such files you descripted existed, as I known,
or you can show me one.
>
> Currently, when subcp argument does not start with 'enca:', every subtitle
> file is assumed to be in the given codepage without _any_ autodetection. So,
> with the setting subcp=cp936, unicode subtitles will not be work. Your patch
> enables autodetection only for UCS-2BE and UCS-2LE, but not for UCS-2(w/o
> BOM) or UTF-8 or any other codepage.
>
> It seems to be the wrong way. If you want autodetection, use enca.
>
> > > The attached file works fine here, because it is LE and my machine is
> > > also LE. However, UCS-2BE files indeed cannot be opened without explicit
> > > -subcp. The problem is, enca detects file encoding as simply 'UCS-2', and
> > > iconv does not pay attention to BOM.
> > >
> > > In fact, enca is able to detect the endianness of unicode files, even
> > > without a BOM sometimes. This information is available via EncaSurface.
> > > It seems a good idea to use it, and only do manual detection when not
> > > using enca.
> >
> > I did not known enca, will have a try. Even enca did part work for us,
> > in my machine (powerpc G5), I think this ass file will still not work
> > since my machine byte-order is big-endian.
> >
> > If you accept this patch, should I move the BOM check code into
> > sub_recode? Currently I put them after read_file because BOM generally
> > directly read from utf-16 files.
>
> I'd prefer a solution using EncaSurface. If you insist, this code is also ok,
> but it must be moved in sub_recode, under #ifdef ENCA, and executed only if
> enca autodetected codepage is 'UCS-2' or 'UTF-16'.
enca 1.9 guess codepage is UCS-2, with worked with iconv 2.4 without
problem either for utf-16le or utf-16be bom stream.
At least enca can resolve my problem even without this patch, it's
acceptable for me:)
--
Ulion
More information about the MPlayer-dev-eng
mailing list