[MPlayer-dev-eng] [PATCH]breakline properly with subtitles using Chinese
Timothy Lee
timothy.lee at siriushk.com
Fri Nov 25 02:57:49 CET 2005
Rich Felker wrote:
> On Thu, Nov 24, 2005 at 03:20:35PM -0500, The Wanderer wrote:
>
>> Rich Felker wrote:
>>
>>
>>> On Thu, Nov 24, 2005 at 11:10:36PM +0800, ?????? wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> MPlayer treat characters without whitespace as a single word and
>>>> try to render it in one line, if the length is too long then the
>>>> word will be truncated, however in Chinese and many asia languages
>>>> there is no whitespace to seperate word, so MPlayer usually treat a
>>>> sentence as a word, and usually the sentence will be truncate if it
>>>> cannot be rendered in one line. This patch detect if the character
>>>> is some asia char, (I assume char > 0x800 is a asia char), and try
>>>> to break the sentence when it cannot be rendered as a word. Please
>>>> test it, thx.
>>>>
>>> definitely not correct; not all asian languages or all languages
>>> using characters past 0x800 are splittable at any point! you'll have
>>> to special-case it much more if you want a patch like this to be
>>> accepted. i don't know the correct splitting algorithms so you'll
>>> have to do it yourself but i imagine it involves a database of all
>>> words for chinese and japanese.
>>>
>> This is not feasible. There is no available (or, possibly, even
>> *existing*) database of "all valid Japanese words"; edict isn't a bad
>> start, but from what I can tell it is decidedly far from complete, and
>> has considerable warts (judging by how many are reported and fixed on a
>> regular basis). For that matter, as far as I'm aware there is not
>> actually the concept of "word" in Japanese; certainly it is possible to
>> insert a line break in between any two characters in a Japanese sentence
>> without problems (I see it all the time in, say, video games).
>>
>> Yes, a better way than the simple assumption above of determining
>> whether or not a given character is "Asian" (that is, can have a line
>> break after it regardless of context) is needed - but a "database of all
>> words" is not an available solution.
>>
>
> Well if it's considered ok to break Japanese words between lines, then
> I suppose you can insert breaks between any Japanese characters.
> However: Unicode does not distinguish between Chinese/Japanese/Korean
> characters, so is the same permissible in all three?? Anyway it's
> definitely not acceptable to do the splitting in _any_ asian language.
> Like I said in Tibetan multiple characters make up a syllable and may
> not be split, and naturally you can never split a combining character
> (present in Tibetan, Thai, and probably many south asian languages)
> from the character(s) it combines with!
>
> Perhaps the original proposal would be ok if you s/asian/CJK/. I just
> don't know. But it's definitely not acceptable doing this for all
> chars > 0x800!
>
>
This is a function I wrote for another library that uses CJK code blocks
to check for permissible line breaks. Perhaps it can be used as a
reference:
// Returns non-zero value if character allows line-break after it
int is_linebreak(unsigned int ucs)
{
/* Space or tab */
if (ucs == ' ' || ucs == '\t') return 1;
/* U+2E80..U+2EFF: CJK Radical Supplement */
/* U+2F00..U+2FDF: Kangxi Radicals */
/* U+2FF0..U+2FFF: Ideographic Description Characters */
/* U+3000..U+303F: CJK Symbols and Punctuation */
/* U+3040..U+309F: Hiragana */
/* U+30A0..U+30FF: Katakana */
/* U+3100..U+312F: Bopomofo */
/* U+3130..U+318F: Hangul Compatibility Jamo */
/* U+3190..U+319F: Kanbun */
/* U+31A0..U+31BF: Bopomofo Extended */
/* U+31C0..U+31EF: CJK Strokes */
/* U+31F0..U+31FF: Katakana Phonetic Extensions */
/* U+3200..U+32FF: Enclosed CJK Letters and Months */
/* U+3300..U+33FF: CJK Compatibility */
/* U+3400..U+4DB5: CJK Ideographs Extension A */
if (ucs >= 0x2e80 && ucs <= 0x4db5) return 1;
/* U+4E00..U+9FBB: CJK Ideographs */
if (ucs >= 0x4e00 && ucs <= 0x9fbb) return 1;
/* U+A000..U+A48F: Yi Syllables */
/* U+A490..U+A4CF: Yi Radicals */
if (ucs >= 0xa000 && ucs <= 0xa4cf) return 1;
/* U+F900..U+FAFF: CJK Compatibility Ideographs */
if (ucs >= 0xf900 && ucs <= 0xfaff) return 1;
/* U+FE30..U+FE4F: CJK Compatibility Forms */
/* U+FE50..U+FE6F: Small Form Variants */
if (ucs >= 0xfe30 && ucs <= 0xfe6f) return 1;
/* U+FF00..U+FFEF: Half-Width CJK Symbols and Punctuation */
if (ucs == 0xff0c || ucs == 0xff0e || ucs == 0xff1a || ucs == 0xff1b ||
(ucs >= 0xff60 && ucs <= 0xffdf)) return 1;
return 0;
}
More information about the MPlayer-dev-eng
mailing list