[MPlayer-dev-eng] [PATCH] Libass BOM fix
Evgeniy Stepanov
eugeni.stepanov at gmail.com
Sat Oct 27 15:25:05 CEST 2007
On Saturday 27 October 2007 16:49:04 Ulion wrote:
> The BOM header for utf-16 is wild used at least by windows. Try save a
> file with 'unicode' or 'unicode big endian' encoding by notepad, you
> will get utf-16 encoded file with BOM header.
> I think the BOM should override the codepage value since the BOM is
> hardcoded encoding marker for utf-16. There's only one subcp we can
> set but more than one subtitle file will use that setting. As me, I
> set subcp=cp936 as my default setting, that will cause failing to load
> this ass forever unless do a BOM fix.
> About BOM: http://en.wikipedia.org/wiki/Byte_Order_Mark
> I think this patch will not break things, it fix things.
I know what BOM is, but it applies only to unicode. In lots of other
encodings, both 0xFF and 0xFE are valid characters. They can appear in the
beginning of the file. Then, with your patch, this file will be mistakenly
identified as UCS, overriding subcp setting.
Currently, when subcp argument does not start with 'enca:', every subtitle
file is assumed to be in the given codepage without _any_ autodetection. So,
with the setting subcp=cp936, unicode subtitles will not be work. Your patch
enables autodetection only for UCS-2BE and UCS-2LE, but not for UCS-2(w/o
BOM) or UTF-8 or any other codepage.
It seems to be the wrong way. If you want autodetection, use enca.
> > The attached file works fine here, because it is LE and my machine is
> > also LE. However, UCS-2BE files indeed cannot be opened without explicit
> > -subcp. The problem is, enca detects file encoding as simply 'UCS-2', and
> > iconv does not pay attention to BOM.
> >
> > In fact, enca is able to detect the endianness of unicode files, even
> > without a BOM sometimes. This information is available via EncaSurface.
> > It seems a good idea to use it, and only do manual detection when not
> > using enca.
>
> I did not known enca, will have a try. Even enca did part work for us,
> in my machine (powerpc G5), I think this ass file will still not work
> since my machine byte-order is big-endian.
>
> If you accept this patch, should I move the BOM check code into
> sub_recode? Currently I put them after read_file because BOM generally
> directly read from utf-16 files.
I'd prefer a solution using EncaSurface. If you insist, this code is also ok,
but it must be moved in sub_recode, under #ifdef ENCA, and executed only if
enca autodetected codepage is 'UCS-2' or 'UTF-16'.
More information about the MPlayer-dev-eng
mailing list