[MPlayer-dev-eng] [PATCH] Libass BOM fix

Ulion ulion2002 at gmail.com
Sat Oct 27 14:49:04 CEST 2007


2007/10/27, Evgeniy Stepanov <eugeni.stepanov at gmail.com>:
> On Saturday 27 October 2007 11:06:51 Ulion wrote:
> > Hello,
> >
> > Some ass/ssa file is utf-16 encoded with a BOM (byte-order marker) as
> > the beginning of file.
> > While libass did not handle such file correctly, it will fail to load
> > such subtitle files.
> > This patch fixed this problem. The second file is for testing which is
> > an utf-16 encoded ass file.
>
> It might break things if a file in some other encoding starts with 'FFFE'
> or 'FEFF'. Not sure if it can happen. It's probably more safe to only do this
> detection if codepage is already some kind of unicode.

The BOM header for utf-16 is wild used at least by windows. Try save a
file with 'unicode' or 'unicode big endian' encoding by notepad, you
will get utf-16 encoded file with BOM header.
I think the BOM should override the codepage value since the BOM is
hardcoded encoding marker for utf-16. There's only one subcp we can
set but more than one subtitle file will use that setting. As me, I
set subcp=cp936 as my default setting, that will cause failing to load
this ass forever unless do a BOM fix.
About BOM: http://en.wikipedia.org/wiki/Byte_Order_Mark
I think this patch will not break things, it fix things.

>
> The attached file works fine here, because it is LE and my machine is also LE.
> However, UCS-2BE files indeed cannot be opened without explicit -subcp. The
> problem is, enca detects file encoding as simply 'UCS-2', and iconv does not
> pay attention to BOM.
>
> In fact, enca is able to detect the endianness of unicode files, even without
> a BOM sometimes. This information is available via EncaSurface. It seems a
> good idea to use it, and only do manual detection when not using enca.

I did not known enca, will have a try. Even enca did part work for us,
in my machine (powerpc G5), I think this ass file will still not work
since my machine byte-order is big-endian.

If you accept this patch, should I move the BOM check code into
sub_recode? Currently I put them after read_file because BOM generally
directly read from utf-16 files.


-- 
Ulion



More information about the MPlayer-dev-eng mailing list