On Tue, 16 Apr 2013 11:35:17 -0400
"Mark Overmeer via RT" <bug-Unicode-LineBreak@rt.cpan.org> wrote:
Show quoted text> * Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130416 14:51]:
> > > Please make your new() internally this smart. "man perlunicode" says
> > > that non-utf8 strings
> > > are to be interpreted as in latin1.
> >
> > IMO It is the feature.
> >
> > Unicode::GCString upgrades a Unicode string, not byte-string, to
> > a grapheme cluster string.
> >
> > In fact, why is it natural that code point 230 is the small AE?
> > It may be the small C with circumflex for Central Europian users.
> > Moreover, it may be the first byte of Chinese characters encoded
> > by UTF-8 (not utf8-flagged string).
>
> Not according to the perlunicode manual-page:
>
> ... implicit upgrading from byte strings to Unicode strings
> assumes that they were encoded in ISO 8859-1 (Latin-1), ...
>
> Prove:
>
> perl -we 'binmode STDOUT,":encoding(utf8)";print chr(230)."\N{U+00E6}\n"'
> # the first character gets converted into utf8, because the second is
> # utf8. It uses the latin1 interpretation for the conversion.
>
>
> A PV can contain three things:
> 1) bytes
> 2) latin1 string
> 3) utf8 string
>
> Scalars with other than latin1/utf8 content are bytes. You can easily
> distiguish (3). The program cannot distiguish between (1) and (2),
> which is a problem. I still like this idea:
>
http://search.cpan.org/~juerd/BLOB
>
> Of course, many people confuse bytes with text.
>
> > Programmers, not users, must decode byte-string to character-
> > string (Unicde string) properly, and then feed it to new() method.
>
> The official advice is: decode the data when it arrives into your program,
> encode it when it leaves your program. When you do that correctly,
> inside the program your texts are either latin1 or utf8, nothing else.
I got it.
I thought if the advice had been done, texts inside program could
be utf8 (Unicode string) _at all_. However, your example above
points out that it is impossible: Perl5 does not have notation for
Unicode literals with code points lower than 256. So byte-strings
given by codes have to be treated as character-strings.
# IMHO, this "implicit upgrade" feature can not be benefit but
# trouble for non-West-European users.
Anyway, we have to admit this feature (or lack of feature).
Show quoted text> I do *not* want to know for each call into modules whether I have to
> apply encoding or decoding to get it to work. Your program processes
> text (hence not bytes), so I expect it to handle both (2) and (3).
> In my opinion, this does not qualify as 'leaving the program'.
>
> When you decide to adapt your definition of text to Perl's definition,
> then it is one extra line of code, and your documentation gets a little
> shorter. It will not break existing applications of your module: it
> extends functionality.
I now understood that everything to do is simply convert byte-
string to utf8-flagged string using such as bytes_to_utf8().
I will able to add it to next development release.
Show quoted text> --
> Sorry for complaining,
>
> MarkOv
Nothing. Thank you for good suggestion!
Regards.
--
--- nezumi