Show quoted text > Given encoding issues and the complications of implementing
hyphenation rules for multiple languages, this is something that's
better left to an add-on module.
True, but should we think about building in some simple word splitting
scenarios? I would really like to split words after hyphens, but beyond
that, it could get messy with non-ASCII characters. You don't want
arbitrary (non-language sensitive) word splitting between accented Latin
characters and ASCII letters, without being fully aware of the encoding
used. You also don't want to end up accidentally splitting within a
UTF-8 multibyte character. Em and en dashes, non-breaking spaces, soft
hyphens, and various thickness space characters will depend on the
encoding. ASCII characters and text are easy enough, but what to do
about anything not ASCII? Perhaps allow splitting only between ASCII
characters (0xxxxxxx byte) for now? It should be safe for multibyte
UTF-8 characters, as all bytes for non-ASCII start with a 1 bit
(1xxxxxxx). I think we could safely break between ASCII characters for
hyphen and other non-letters in the range x21..x7E, and letters (letter
to non-letter, or non-letter to letter transition, as well as
lower-to-upper and upper-to-lower camelCase). Would that be useful? A
dummy hook might be put in for future calling of user-supplied
hyphenation routines for various encodings and languages, or just mark
the spot in the code for now. For English, at least, a minimum of two
characters must be left on each line, and be careful about not splitting
something like O'Mallory into O'-Mallory, or thinking Ma is camelCase
and splitting it O'M-allory.
I'd sure like to get some other people participating in this discussion,
to get some more viewpoints and algorithm experience. Perhaps we should
just go ahead with starting the add-on module with the above simple
algorithm, and flesh it out over time?