Bug #84661 for Unicode-LineBreak: Object only accepts unicode strings

Tue Apr 16 06:25:36 2013 MARKOV [...] cpan.org - Ticket created

Subject:

Object only accepts unicode strings

Handling strings is correctly is difficult in Perl. Sometimes we need to know the internals. The official rule is: users should not need to know whether a string has the utf8 flag on or not. When I do this: my $latin1 = chr 230; my $gc = Unicode::GCString->new($latin1); I get the error _new: Unicode string must be given. at /home/perl/5.16.2/lib/site_perl/5.16.2/x86_64-linux/Unicode/GCString.pm line 46. So: this does not follow Perl's rule of "it just works". I can do my $latin1 = chr 230; my $gc = Unicode::GCString->new(decode latin1 => $latin1); Now, I do not get the error. Great. So, for correctly calling your object in any circumstance, I need to call: my $gc = Unicode::GCString->new(is_utf8($s) ? $s : decode(latin1 => $s)); Which is inconvenient. Please make your new() internally this smart. "man perlunicode" says that non-utf8 strings are to be interpreted as in latin1.

Tue Apr 16 10:51:24 2013 hatuka [...] nezumi.nu - Correspondence added

Mark, On 2013-4月-16 火 06:25:36, MARKOV wrote: Show quoted text

> Handling strings is correctly is difficult in Perl. Sometimes we need > to know the internals. The official rule is: users should not need to > know whether a string has the utf8 flag on or not. > > When I do this: > > my $latin1 = chr 230; > my $gc = Unicode::GCString->new($latin1); > > I get the error > > _new: Unicode string must be given. at > /home/perl/5.16.2/lib/site_perl/5.16.2/x86_64- > linux/Unicode/GCString.pm line 46. > > So: this does not follow Perl's rule of "it just works". I can do > > my $latin1 = chr 230; > my $gc = Unicode::GCString->new(decode latin1 => $latin1); > > Now, I do not get the error. Great. So, for correctly calling your > object in > any circumstance, I need to call: > > my $gc = Unicode::GCString->new(is_utf8($s) ? $s : decode(latin1 => > $s)); > > Which is inconvenient. > > Please make your new() internally this smart. "man perlunicode" says > that non-utf8 strings > are to be interpreted as in latin1.

IMO It is the feature. Unicode::GCString upgrades a Unicode string, not byte-string, to a grapheme cluster string. In fact, why is it natural that code point 230 is the small AE? It may be the small C with circumflex for Central Europian users. Moreover, it may be the first byte of Chinese characters encoded by UTF-8 (not utf8-flagged string). Programmers, not users, must decode byte-string to character- string (Unicde string) properly, and then feed it to new() method. Regards, -- --- nezumi

Tue Apr 16 10:51:24 2013 The RT System itself - Status changed from 'new' to 'open'

Tue Apr 16 11:35:17 2013 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #84661] Object only accepts unicode strings
Date:	Tue, 16 Apr 2013 17:34:42 +0200
To:	Hatuka*nezumi - IKEDA Soji via RT <bug-Unicode-LineBreak [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130416 14:51]: Show quoted text

> > Please make your new() internally this smart. "man perlunicode" says > > that non-utf8 strings > > are to be interpreted as in latin1.

> > IMO It is the feature. > > Unicode::GCString upgrades a Unicode string, not byte-string, to > a grapheme cluster string. > > In fact, why is it natural that code point 230 is the small AE? > It may be the small C with circumflex for Central Europian users. > Moreover, it may be the first byte of Chinese characters encoded > by UTF-8 (not utf8-flagged string).

Not according to the perlunicode manual-page: ... implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), ... Prove: perl -we 'binmode STDOUT,":encoding(utf8)";print chr(230)."\N{U+00E6}\n"' # the first character gets converted into utf8, because the second is # utf8. It uses the latin1 interpretation for the conversion. A PV can contain three things: 1) bytes 2) latin1 string 3) utf8 string Scalars with other than latin1/utf8 content are bytes. You can easily distiguish (3). The program cannot distiguish between (1) and (2), which is a problem. I still like this idea: http://search.cpan.org/~juerd/BLOB Of course, many people confuse bytes with text. Show quoted text

> Programmers, not users, must decode byte-string to character- > string (Unicde string) properly, and then feed it to new() method.

The official advice is: decode the data when it arrives into your program, encode it when it leaves your program. When you do that correctly, inside the program your texts are either latin1 or utf8, nothing else. I do *not* want to know for each call into modules whether I have to apply encoding or decoding to get it to work. Your program processes text (hence not bytes), so I expect it to handle both (2) and (3). In my opinion, this does not qualify as 'leaving the program'. When you decide to adapt your definition of text to Perl's definition, then it is one extra line of code, and your documentation gets a little shorter. It will not break existing applications of your module: it extends functionality. -- Sorry for complaining, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Wed Apr 17 10:39:46 2013 hatuka [...] nezumi.nu - Correspondence added

On Tue, 16 Apr 2013 11:35:17 -0400 "Mark Overmeer via RT" <bug-Unicode-LineBreak@rt.cpan.org> wrote: Show quoted text

> * Hatuka*nezumi - IKEDA Soji via RT (bug-Unicode-LineBreak@rt.cpan.org) [130416 14:51]:

> > > Please make your new() internally this smart. "man perlunicode" says > > > that non-utf8 strings > > > are to be interpreted as in latin1.

> > > > IMO It is the feature. > > > > Unicode::GCString upgrades a Unicode string, not byte-string, to > > a grapheme cluster string. > > > > In fact, why is it natural that code point 230 is the small AE? > > It may be the small C with circumflex for Central Europian users. > > Moreover, it may be the first byte of Chinese characters encoded > > by UTF-8 (not utf8-flagged string).

> > Not according to the perlunicode manual-page: > > ... implicit upgrading from byte strings to Unicode strings > assumes that they were encoded in ISO 8859-1 (Latin-1), ... > > Prove: > > perl -we 'binmode STDOUT,":encoding(utf8)";print chr(230)."\N{U+00E6}\n"' > # the first character gets converted into utf8, because the second is > # utf8. It uses the latin1 interpretation for the conversion. > > > A PV can contain three things: > 1) bytes > 2) latin1 string > 3) utf8 string > > Scalars with other than latin1/utf8 content are bytes. You can easily > distiguish (3). The program cannot distiguish between (1) and (2), > which is a problem. I still like this idea: > http://search.cpan.org/~juerd/BLOB > > Of course, many people confuse bytes with text. >

> > Programmers, not users, must decode byte-string to character- > > string (Unicde string) properly, and then feed it to new() method.

> > The official advice is: decode the data when it arrives into your program, > encode it when it leaves your program. When you do that correctly, > inside the program your texts are either latin1 or utf8, nothing else.

I got it. I thought if the advice had been done, texts inside program could be utf8 (Unicode string) _at all_. However, your example above points out that it is impossible: Perl5 does not have notation for Unicode literals with code points lower than 256. So byte-strings given by codes have to be treated as character-strings. # IMHO, this "implicit upgrade" feature can not be benefit but # trouble for non-West-European users. Anyway, we have to admit this feature (or lack of feature). Show quoted text

> I do *not* want to know for each call into modules whether I have to > apply encoding or decoding to get it to work. Your program processes > text (hence not bytes), so I expect it to handle both (2) and (3). > In my opinion, this does not qualify as 'leaving the program'. > > When you decide to adapt your definition of text to Perl's definition, > then it is one extra line of code, and your documentation gets a little > shorter. It will not break existing applications of your module: it > extends functionality.

I now understood that everything to do is simply convert byte- string to utf8-flagged string using such as bytes_to_utf8(). I will able to add it to next development release. Show quoted text

> -- > Sorry for complaining, > > MarkOv

Nothing. Thank you for good suggestion! Regards. -- --- nezumi

Tue May 14 12:52:58 2013 hatuka [...] nezumi.nu - Status changed from 'open' to 'resolved'

Tue May 14 12:52:58 2013 hatuka [...] nezumi.nu - Taken

Tue May 14 12:52:58 2013 hatuka [...] nezumi.nu - Fixed in 2013.004_26 added