[MPlayer-dev-eng] [PATCH] UTF-8 osd messages

Fri Apr 28 13:37:36 CEST 2006

On Fri, Apr 28, 2006 at 10:46:10AM +0300, Ivan Kalvachev wrote:
> 2006/4/28, Adam Tlałka <atlka at pg.gda.pl>:
> > On Tue, Apr 25, 2006 at 02:56:59PM +0200, Reimar D�ffinger wrote:
> > > Hi,
> > > On Tue, Apr 25, 2006 at 09:54:02AM +0300, Ivan Kalvachev wrote:
> > > > 2006/4/23, Reimar Döffinger <Reimar.Doeffinger at stud.uni-karlsruhe.de>:
> > > > > Also, the utf8_get_char function is a bit overkill for this, since it will
> > > > > always encouter valid UTF-8 in this case, but I made it more flexible in
> > > > > the hope in will be used in other place as well in the future.
> > > >
> > > > I think this is going to be the second utf-8 parser in that file, as
> > > > subs are (usually) already utf-8.
> > > > Take a look of vo_update_text_sub::382
> > >
> > > I know, but that code could in case of invalid encoding skip the terminating
> > > NULL and cause a segfault. Thus it is unsuitable for most cases, and this is
> > > actually one of the utf8-parsers I'd like to replace.
> > >
> > What about storing strings internally in UCS-2?
> > Parse once when reading so there will be no need to recode UTF-8 ->
> > UCS-2 before displaying.
> 
> If I remember right UCS-2 is 2 byte unicode. Unfortunately the Unicode
> standard wasn't finished at the time M$ created that. It was finally
> decided that full unicode is 4 bytes (Of course not the whole range is
> used). This requiring some hacks to be made to support UCS-2 (this is
> what Rathann is referring when pointing the UTF-16).
> 
> Using fixed size characters may have some benefits, but in our case we
> don't need it.

In libvo/sub.c -> vo_update_text_sub in case of UTF-8 subs we recode
UTF-8 to font positions - 2 and 3 bytes sequences which means
U+00080..U+00FFFF so it is just an UCS-2 Unicode representation.;-)

Of course this UTF-8 -> UCS-2 (2 byte Unicode) decoder is quite simple
and not detecting inproper UTF-8 seqences. But its fast.
If iconv is working properly then there should be no problem with that
approach but in case of broken UTF-8 encoded subtitles it can lead
to undesired effects - we just read the text from file - no UTF-8 check.

So maybe better is to encode directly to UCS-2 or just to the font
positions while reading subtitle files and detect errors there
and not in the subtitle display fuction?

Anyway in case where sub_utf8 is not set we already treat vo_sub->text
bytes as font glyphs positions so maybe it always can have
this meaning and we can treat them as u8 or u16 glyphs positions
depending on some flag (sub_ucs  0:u8 1:u16)?

It is more universal and could be prepared even for UCS-4 in the future.

Regards
-- 
Adam Tlałka       mailto:atlka at pg.gda.pl    ^v^ ^v^ ^v^
PGP public key:   finger atlka at sunrise.pg.gda.pl