[MPlayer-dev-eng] [PATCH]breakline properly with subtitles using Chinese

Rich Felker dalias at aerifal.cx
Fri Nov 25 02:02:32 CET 2005


On Thu, Nov 24, 2005 at 03:20:35PM -0500, The Wanderer wrote:
> Rich Felker wrote:
> 
> >On Thu, Nov 24, 2005 at 11:10:36PM +0800, ?????? wrote:
> >
> >>Hi,
> >>
> >>MPlayer treat characters without whitespace as a single word and
> >>try to render it in one line, if the length is too long then the
> >>word will be truncated, however in Chinese and many asia languages
> >>there is no whitespace to seperate word, so MPlayer usually treat a
> >>sentence as a word, and usually the sentence will be truncate if it
> >>cannot be rendered in one line. This patch detect if the character
> >>is some asia char, (I assume char > 0x800 is a asia char), and try
> >>to break the sentence when it cannot be rendered as a word. Please
> >>test it, thx.
> >
> >definitely not correct; not all asian languages or all languages
> >using characters past 0x800 are splittable at any point! you'll have
> >to special-case it much more if you want a patch like this to be 
> >accepted. i don't know the correct splitting algorithms so you'll
> >have to do it yourself but i imagine it involves a database of all
> >words for chinese and japanese.
> 
> This is not feasible. There is no available (or, possibly, even
> *existing*) database of "all valid Japanese words"; edict isn't a bad
> start, but from what I can tell it is decidedly far from complete, and
> has considerable warts (judging by how many are reported and fixed on a
> regular basis). For that matter, as far as I'm aware there is not
> actually the concept of "word" in Japanese; certainly it is possible to
> insert a line break in between any two characters in a Japanese sentence
> without problems (I see it all the time in, say, video games).
> 
> Yes, a better way than the simple assumption above of determining
> whether or not a given character is "Asian" (that is, can have a line
> break after it regardless of context) is needed - but a "database of all
> words" is not an available solution.

Well if it's considered ok to break Japanese words between lines, then
I suppose you can insert breaks between any Japanese characters.
However: Unicode does not distinguish between Chinese/Japanese/Korean
characters, so is the same permissible in all three?? Anyway it's
definitely not acceptable to do the splitting in _any_ asian language.
Like I said in Tibetan multiple characters make up a syllable and may
not be split, and naturally you can never split a combining character
(present in Tibetan, Thai, and probably many south asian languages)
from the character(s) it combines with!

Perhaps the original proposal would be ok if you s/asian/CJK/. I just
don't know. But it's definitely not acceptable doing this for all
chars > 0x800!

Rich




More information about the MPlayer-dev-eng mailing list