[MPlayer-dev-eng] [PATCH]breakline properly with subtitles using Chinese

Timothy Lee timothy.lee at siriushk.com
Fri Nov 25 02:57:49 CET 2005


Rich Felker wrote:
> On Thu, Nov 24, 2005 at 03:20:35PM -0500, The Wanderer wrote:
>   
>> Rich Felker wrote:
>>
>>     
>>> On Thu, Nov 24, 2005 at 11:10:36PM +0800, ?????? wrote:
>>>
>>>       
>>>> Hi,
>>>>
>>>> MPlayer treat characters without whitespace as a single word and
>>>> try to render it in one line, if the length is too long then the
>>>> word will be truncated, however in Chinese and many asia languages
>>>> there is no whitespace to seperate word, so MPlayer usually treat a
>>>> sentence as a word, and usually the sentence will be truncate if it
>>>> cannot be rendered in one line. This patch detect if the character
>>>> is some asia char, (I assume char > 0x800 is a asia char), and try
>>>> to break the sentence when it cannot be rendered as a word. Please
>>>> test it, thx.
>>>>         
>>> definitely not correct; not all asian languages or all languages
>>> using characters past 0x800 are splittable at any point! you'll have
>>> to special-case it much more if you want a patch like this to be 
>>> accepted. i don't know the correct splitting algorithms so you'll
>>> have to do it yourself but i imagine it involves a database of all
>>> words for chinese and japanese.
>>>       
>> This is not feasible. There is no available (or, possibly, even
>> *existing*) database of "all valid Japanese words"; edict isn't a bad
>> start, but from what I can tell it is decidedly far from complete, and
>> has considerable warts (judging by how many are reported and fixed on a
>> regular basis). For that matter, as far as I'm aware there is not
>> actually the concept of "word" in Japanese; certainly it is possible to
>> insert a line break in between any two characters in a Japanese sentence
>> without problems (I see it all the time in, say, video games).
>>
>> Yes, a better way than the simple assumption above of determining
>> whether or not a given character is "Asian" (that is, can have a line
>> break after it regardless of context) is needed - but a "database of all
>> words" is not an available solution.
>>     
>
> Well if it's considered ok to break Japanese words between lines, then
> I suppose you can insert breaks between any Japanese characters.
> However: Unicode does not distinguish between Chinese/Japanese/Korean
> characters, so is the same permissible in all three?? Anyway it's
> definitely not acceptable to do the splitting in _any_ asian language.
> Like I said in Tibetan multiple characters make up a syllable and may
> not be split, and naturally you can never split a combining character
> (present in Tibetan, Thai, and probably many south asian languages)
> from the character(s) it combines with!
>
> Perhaps the original proposal would be ok if you s/asian/CJK/. I just
> don't know. But it's definitely not acceptable doing this for all
> chars > 0x800!
>
>   
This is a function I wrote for another library that uses CJK code blocks 
to check for permissible line breaks.  Perhaps it can be used as a 
reference:

// Returns non-zero value if character allows line-break after it
int is_linebreak(unsigned int ucs)
{
  /* Space or tab */
  if (ucs == ' ' || ucs == '\t')  return 1;
 
  /* U+2E80..U+2EFF: CJK Radical Supplement */
  /* U+2F00..U+2FDF: Kangxi Radicals */
  /* U+2FF0..U+2FFF: Ideographic Description Characters */
  /* U+3000..U+303F: CJK Symbols and Punctuation */
  /* U+3040..U+309F: Hiragana */
  /* U+30A0..U+30FF: Katakana */
  /* U+3100..U+312F: Bopomofo */
  /* U+3130..U+318F: Hangul Compatibility Jamo */
  /* U+3190..U+319F: Kanbun */
  /* U+31A0..U+31BF: Bopomofo Extended */
  /* U+31C0..U+31EF: CJK Strokes */
  /* U+31F0..U+31FF: Katakana Phonetic Extensions */
  /* U+3200..U+32FF: Enclosed CJK Letters and Months */
  /* U+3300..U+33FF: CJK Compatibility */
  /* U+3400..U+4DB5: CJK Ideographs Extension A */
  if (ucs >= 0x2e80 && ucs <= 0x4db5)  return 1;

  /* U+4E00..U+9FBB: CJK Ideographs */
  if (ucs >= 0x4e00 && ucs <= 0x9fbb)  return 1;

  /* U+A000..U+A48F: Yi Syllables */
  /* U+A490..U+A4CF: Yi Radicals */
  if (ucs >= 0xa000 && ucs <= 0xa4cf)  return 1;

  /* U+F900..U+FAFF: CJK Compatibility Ideographs */
  if (ucs >= 0xf900 && ucs <= 0xfaff)  return 1;

  /* U+FE30..U+FE4F: CJK Compatibility Forms */
  /* U+FE50..U+FE6F: Small Form Variants */
  if (ucs >= 0xfe30 && ucs <= 0xfe6f)  return 1;

  /* U+FF00..U+FFEF: Half-Width CJK Symbols and Punctuation */
  if (ucs == 0xff0c || ucs == 0xff0e || ucs == 0xff1a || ucs == 0xff1b ||
    (ucs >= 0xff60 && ucs <= 0xffdf))  return 1;

  return 0;
}




More information about the MPlayer-dev-eng mailing list