Bug #34259 for Encode: Utf8 flag on after decoding 100% ASCII data

Wed Mar 19 17:29:36 2008 MSCHILLI [...] cpan.org - Ticket created

Subject:

Utf8 flag on after decoding 100% ASCII data

Hi Dan, thanks for Encode, it's a great module! My collegue Richard Russo has found a case where Encode decodes 100% ASCII data and subsequently sets the utf8 flag: my $string = "191501885"; my $id = decode_utf8( $string ); print "$id " , Encode::is_utf8($id), "\n"; $id = decode ( "utf8", $string ); print "$id " , Encode::is_utf8($id), "\n"; yields 191501885 1 191501885 1 while according to the documentation, strings that are 100% ascii shouldn't have the utf8 flag on after they're utf8-decoded. Note that the string contains a 100% ASCII string and not a number. Would be great if you could take a look -- thanks! -- Mike

Wed May 07 16:24:19 2008 DANKOGAI [...] cpan.org - Correspondence added

I consider the behavior natural. Consider the case below. while(<>){ my $utf8 = decode_utf8($_); # .... } The subsequent code must be written conditionally if decode_utf8 conditionally sets the flag. Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text

> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >

Wed May 07 16:24:22 2008 The RT System itself - Status changed from 'new' to 'open'

Wed May 07 16:24:23 2008 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Wed May 14 23:31:57 2008 cpan [...] robm.fastmail.fm - Correspondence added

Show quoted text

> I consider the behavior natural. Consider the case below. > > while(<>){ > my $utf8 = decode_utf8($_); > # .... > } > > The subsequent code must be written conditionally if decode_utf8 > conditionally sets the flag.

Sorry to be a pain, but this is complete garbage! The Encode documentation even goes into great detail to explain this. Look at the "The UTF8 flag" section: http://search.cpan.org/~dankogai/Encode-2.25/Encode.pm --- Goal #1: Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on. Goal #2: Old byte-oriented programs should magically start working on the new character-oriented data when appropriate. Goal #3: Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode. ... # When you decode, the resulting UTF8 flag is on unless you can unambiguously represent data. Here is the definition of dis-ambiguity. After $utf8 = decode('foo', $octet);, When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF .. As you see, there is one exception, In ASCII. That way you can assume Goal #1. And with Encode Goal #2 is assumed but you still have to be careful in such cases mentioned in CAVEAT paragraphs. --- So this bug is actually a bug. The data is ASCII only, so the UTF8 should be OFF after the decode. And fixing this bug should NOT be changing the documentation. This slows down the case where there is ASCII only data.

Wed May 14 23:31:58 2008 The RT System itself - Status changed from 'resolved' to 'open'

Wed May 14 23:45:53 2008 cpan [...] robm.fastmail.fm - Correspondence added

Show quoted text

> I consider the behavior natural. Consider the case below. > > while(<>){ > my $utf8 = decode_utf8($_); > # .... > } > > The subsequent code must be written conditionally if decode_utf8 > conditionally sets the flag.

I might add that this is not true either, there should be no need for conditional code at all, the whole point is that programs don't need to look at the UTF8 flag. If you have a string that's pure ASCII and has the UTF8 flag ON, then you can at any time join it with a string with the UTF8 flag that is ON, and it will be promoted just fine. my $asciistr = "hello"; # UTF8 flag OFF my $utf8octets1 = "hello"; # UTF8 flag OFF my $utf8octets2 = "\342\230\272"; # UTF8 flag OFF my $perlstr1 = decode_utf8($utf8octets1) # UTF8 flag OFF my $perlstr2 = decode_utf8($utf8octets2) # UTF8 flag ON my $perlstr3 = "\x{263a}"; # UTF8 flag ON my $result1 = $perlstr1 . $perlstr2; # UTF8 flag ON my $result2 = $perlstr1 . $asciistr; # UTF8 flag OFF All works just fine. The point is that if you work with data, even if it's incoming utf-8 data that you decode_utf8() to create a "perl string", then if that data was only ASCII data, it's a perl string with the UTF8 flag OFF and you get all the "fast" performance of octets. Only if you use non-ASCII chars do you actually pay the performance cost of perl strings with the UTF8 flag being on.

Tue Jul 01 16:09:09 2008 DANKOGAI [...] cpan.org - Correspondence added

That one is tough to cope with because in encode (whatever -> utf8), transcoder is set so that it only complains the first byte that is malformed while decode (utf8 -> whatever) complains the whole unicode. The problem is that the transcoder is shared with other encodings so fixing this may break other encodings. I'll leave this ticket open till I come up with something better. Dan the Encode Maintainer On Wed May 14 23:45:53 2008, ROBM wrote: Show quoted text

> > I consider the behavior natural. Consider the case below. > > > > while(<>){ > > my $utf8 = decode_utf8($_); > > # .... > > } > > > > The subsequent code must be written conditionally if decode_utf8 > > conditionally sets the flag.

> > I might add that this is not true either, there should be no need for > conditional code at all, the whole point is that programs don't need to > look at the UTF8 flag. > > If you have a string that's pure ASCII and has the UTF8 flag ON, then > you can at any time join it with a string with the UTF8 flag that is ON, > and it will be promoted just fine. > > my $asciistr = "hello"; # UTF8 flag OFF > my $utf8octets1 = "hello"; # UTF8 flag OFF > my $utf8octets2 = "\342\230\272"; # UTF8 flag OFF > > my $perlstr1 = decode_utf8($utf8octets1) # UTF8 flag OFF > my $perlstr2 = decode_utf8($utf8octets2) # UTF8 flag ON > > my $perlstr3 = "\x{263a}"; # UTF8 flag ON > > my $result1 = $perlstr1 . $perlstr2; # UTF8 flag ON > my $result2 = $perlstr1 . $asciistr; # UTF8 flag OFF > > All works just fine. > > The point is that if you work with data, even if it's incoming utf-8 > data that you decode_utf8() to create a "perl string", then if that data > was only ASCII data, it's a perl string with the UTF8 flag OFF and you > get all the "fast" performance of octets. > > Only if you use non-ASCII chars do you actually pay the performance cost > of perl strings with the UTF8 flag being on.

Thu Aug 07 19:54:53 2008 cpan [...] robm.fastmail.fm - Correspondence added

On Tue Jul 01 16:09:09 2008, DANKOGAI wrote: Show quoted text

> That one is tough to cope with because in encode (whatever -> utf8), > transcoder is set so > that it only complains the first byte that is malformed while decode > (utf8 -> whatever) > complains the whole unicode. The problem is that the transcoder is > shared with other > encodings so fixing this may break other encodings.

I would have thought the solution is to keep some "found_non_ascii" (default 0) kind of flag in the transcoder when converting whatever -> utf8. If during the coversion you find a non-ascii output char (eg codepoint >=0x80), you set the flag. At the end of the conversion, you set the perl utf-8 flag on the string to on/off based on the "found_non_ascii" flag? Of course, I don't know the code, so I might be speaking rubbish... Rob

Wed Jan 21 17:22:03 2009 DANKOGAI [...] cpan.org - Correspondence added

Document added in 2.27. See also #41163. Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text

> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >

Wed Jan 21 17:22:31 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Jan 21 20:06:04 2009 cpan [...] robm.fastmail.fm - Correspondence added

On Wed Jan 21 17:22:03 2009, DANKOGAI wrote: Show quoted text

> Document added in 2.27. See also #41163. > > Dan the Encode Maintainer

This hasn't been resolved at all. Doing a diff -ru between 2.26 and 2.27 shows nothing in the changed in the documentation about this problem. In fact there's no mention of bug 34259 anywhere. Worse, the documentation for Encode still clearly states that decoding ASCII only data will return a perl string with the utf-8 flag OFF. Read this section: http://search.cpan.org/~dankogai/Encode/Encode.pm#The_UTF8_flag But when you test it, clearly still doesn't do what it says: $ perl -le 'use Encode; print $Encode::VERSION; print Encode::is_utf8(decode_utf8("blah"));' 2.27 1

Wed Jan 21 20:06:05 2009 The RT System itself - Status changed from 'resolved' to 'open'

Mon Mar 29 03:39:19 2010 bryce2 [...] obviously.com - Correspondence added

From:

bryce2 [...] obviously.com

I just spent hours on this. As of Perl 5.10.1, this bug is still present: <code> perl -le 'use Encode; print $Encode::VERSION; print Encode::is_utf8(decode_utf8("blah"));' 2.23 1 </code>

Thu Jul 07 01:15:48 2011 DANKOGAI [...] cpan.org - Correspondence added

Looks like you just forgot to "use utf8". For compatibility's sake, Perl takes all scripts written in ISO-8859-1 unless you say "use utf8". perldoc perluniintro for details Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text

> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >

Thu Jul 07 01:15:52 2011 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Thu Jul 07 10:41:55 2011 MSCHILLI [...] cpan.org - Correspondence added

On Thu Jul 07 01:15:48 2011, DANKOGAI wrote: Show quoted text

> Looks like you just forgot to "use utf8". For compatibility's sake, > Perl takes all scripts written in ISO-8859-1 unless you say "use utf8".

Sorry, this doesn't make any sense in this context. "use utf8" is irrelevant if your program uses plain ASCII strings, as the snippets of code presented in this bug all do. The problem is that Encode isn't behaving according to its documentation.

Thu Jul 07 10:41:56 2011 The RT System itself - Status changed from 'resolved' to 'open'

Sat Nov 12 17:32:33 2011 chansen [...] cpan.org - Correspondence added

There are three possible solutions to this: 1) Change all encodings to keep track of whether or not a code point above U+007F has been decoded and SvUTF8_(?on|off) accordingly 2) Change Encode::decode() to scan decoded strings for code points above U+007F and SvUTF8_off if no code points are above U+007F 3) Change documentation 1 or 2 isn't beneficial to me since most of my data contain Basic Latin and Latin-1 Supplement characters (Swedish), with occasional Miscellaneous Symbols and General Punctuation. The question is if it's worth the overhead, even English texts makes more and more use of General Punctuation. -- chansen

Sat Nov 12 21:32:47 2011 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

RT-Send-CC:

chansen [...] cpan.org

On Sat Nov 12 17:32:33 2011, CHANSEN wrote: Show quoted text

> There are three possible solutions to this: > 1) Change all encodings to keep track of whether or not a code point > above > U+007F has been decoded and SvUTF8_(?on|off) accordingly > 2) Change Encode::decode() to scan decoded strings for code points > above > U+007F and SvUTF8_off if no code points are above U+007F > 3) Change documentation

I would recommend changing the documentation and possibly removing the whole section on the UTF8 flag altogether. The UTF8 flag is internal to Perl, or at least it is meant to be. All this discussion about it has led to much misunderstanding and chagrin over the years. The way it’s supposed to be is: A string is a string is a string. Previously the max char was 255. Now it’s higher. Show quoted text

> > 1 or 2 isn't beneficial to me since most of my data contain Basic > Latin and > Latin-1 Supplement characters (Swedish), with occasional Miscellaneous > Symbols > and General Punctuation. > > The question is if it's worth the overhead, even English texts makes > more > and more use of General Punctuation. > > -- > chansen

Thu Jan 03 10:28:20 2013 victor [...] vsespb.ru - Correspondence added

Hm. Not sure here. Modules like MIME::Base64 and Digest::SHA (newer versions) die with error if see a string with utf8 bit set. (that looks correct as those functions are defined only for bytes, not for characters). People obviously need to control utf8 bit. On Sun Nov 13 06:32:47 2011, SPROUT wrote: Show quoted text

> On Sat Nov 12 17:32:33 2011, CHANSEN wrote:

> > There are three possible solutions to this: > > 1) Change all encodings to keep track of whether or not a code point > > above > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > 2) Change Encode::decode() to scan decoded strings for code points > > above > > U+007F and SvUTF8_off if no code points are above U+007F > > 3) Change documentation

> > I would recommend changing the documentation and possibly removing the > whole section on > the UTF8 flag altogether. > > The UTF8 flag is internal to Perl, or at least it is meant to be. All > this discussion about it has > led to much misunderstanding and chagrin over the years. The way it’s > supposed to be is: A > string is a string is a string. Previously the max char was 255. Now > it’s higher. >

> > > > 1 or 2 isn't beneficial to me since most of my data contain Basic > > Latin and > > Latin-1 Supplement characters (Swedish), with occasional

> Miscellaneous

> > Symbols > > and General Punctuation. > > > > The question is if it's worth the overhead, even English texts makes > > more > > and more use of General Punctuation. > > > > -- > > chansen

> >

Wed Aug 14 11:43:52 2013 victor [...] vsespb.ru - Correspondence added

From:

victor [...] vsespb.ru

Show quoted text

> Modules like MIME::Base64 and Digest::SHA (newer versions) die with error if see a string with utf8 bit set.

ignore this, this is just wrong On Thu Jan 03 19:28:20 2013, vsespb wrote: Show quoted text

> Hm. Not sure here. > > Modules like MIME::Base64 and Digest::SHA (newer versions) die with > error if see a string with utf8 bit set. (that looks correct as those > functions are defined only for bytes, not for characters). > > People obviously need to control utf8 bit. > > On Sun Nov 13 06:32:47 2011, SPROUT wrote:

> > On Sat Nov 12 17:32:33 2011, CHANSEN wrote:

> > > There are three possible solutions to this: > > > 1) Change all encodings to keep track of whether or not a code point > > > above > > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > > 2) Change Encode::decode() to scan decoded strings for code points > > > above > > > U+007F and SvUTF8_off if no code points are above U+007F > > > 3) Change documentation

> > > > I would recommend changing the documentation and possibly removing the > > whole section on > > the UTF8 flag altogether. > > > > The UTF8 flag is internal to Perl, or at least it is meant to be. All > > this discussion about it has > > led to much misunderstanding and chagrin over the years. The way it’s > > supposed to be is: A > > string is a string is a string. Previously the max char was 255. Now > > it’s higher. > >

> > > > > > 1 or 2 isn't beneficial to me since most of my data contain Basic > > > Latin and > > > Latin-1 Supplement characters (Swedish), with occasional

> > Miscellaneous

> > > Symbols > > > and General Punctuation. > > > > > > The question is if it's worth the overhead, even English texts makes > > > more > > > and more use of General Punctuation. > > > > > > -- > > > chansen

> > > >

> >

Sun Aug 18 07:37:17 2013 victor [...] vsespb.ru - Correspondence added

From:

victor [...] vsespb.ru

On Thu May 15 07:31:57 2008, ROBM wrote: Show quoted text

> --- > Goal #1: > > Old byte-oriented programs should not spontaneously break on the old > byte-oriented data they used to work on. >

Show quoted text

> As you see, there is one exception, In ASCII. That way you can assume > Goal #1. And with Encode Goal #2 is assumed but you still have to be > careful in such cases mentioned in CAVEAT paragraphs.

Show quoted text

> So this bug is actually a bug. The data is ASCII only, so the UTF8 > should be OFF after the decode. > > And fixing this bug should NOT be changing the documentation. This slows > down the case where there is ASCII only data.

I think point about Goal #1 is invalid here. ASCII data can get utf-8 flag, for example, when splitting non-ASCII string (with flag on) to ASCII and non-ASCII parts. ASCII part will have utf8 bit on. Also, "Old byte-oriented" programs never deal with decode() and with any Unicode data, so they are not affected. So, IMHO documentation about ASCII flag behaviour should be dropped (however it's better add notice to CAVEATS)

Sat Mar 04 15:06:43 2017 pali [...] cpan.org - Cc PALI added

Sat Mar 04 15:13:51 2017 pali [...] cpan.org - Correspondence added

On Sob Nov 12 21:32:47 2011, SPROUT wrote: Show quoted text

> On Sat Nov 12 17:32:33 2011, CHANSEN wrote:

> > There are three possible solutions to this: > > 1) Change all encodings to keep track of whether or not a code point > > above > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > 2) Change Encode::decode() to scan decoded strings for code points > > above > > U+007F and SvUTF8_off if no code points are above U+007F > > 3) Change documentation

> > I would recommend changing the documentation and possibly removing the > whole section on > the UTF8 flag altogether.

I'm for removing documentation for whole section "The UTF8 flag". Just "This UTF8 flag is not visible in Perl scripts..." does not have to be removed and could be moved to section "Messing with Perl's Internals". Show quoted text

> The UTF8 flag is internal to Perl, or at least it is meant to be. All > this discussion about it has > led to much misunderstanding and chagrin over the years. The way it’s > supposed to be is: A > string is a string is a string. Previously the max char was 255. Now > it’s higher.

+1

Sat Sep 23 15:18:17 2017 will [...] summercat.com - Correspondence added

Subject:	[rt.cpan.org #34259] +1 on retaining current behaviour
Date:	Sat, 23 Sep 2017 12:17:51 -0700
To:	bug-Encode [...] rt.cpan.org
From:	Will Storey <will [...] summercat.com>

I think the current behaviour makes sense. For one thing, it can help check whether you've decoded. It is just generally simpler to understand as well. I'm in favour of updating the documentation. We could remove everything in the section after "When you decode, the resulting UTF8 flag is on".

Fri Sep 29 08:41:16 2017 pali [...] cpan.org - Correspondence added

See proposed change in the attachment.

Subject:

0001-Remove-misleading-documentation-about-UTF8-flag.patch

From 1116e0fc2c195e2135e9353d90bdfcfa228e9fbf Mon Sep 17 00:00:00 2001 From: Pali <pali@cpan.org> Date: Fri, 29 Sep 2017 14:39:36 +0200 Subject: [PATCH] Remove misleading documentation about UTF8 flag --- Encode.pm | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/Encode.pm b/Encode.pm index 6ed4a77..faf1f58 100644 --- a/Encode.pm +++ b/Encode.pm @@ -822,38 +822,11 @@ different kinds of strings and string-operations in Perl: one a byte-oriented mode for when the internal UTF8 flag is off, and the other a character-oriented mode for when the internal UTF8 flag is on. -Here is how C<Encode> handles the UTF8 flag. - -=over 2 - -=item * - -When you I<encode>, the resulting UTF8 flag is always B<off>. - -=item * - -When you I<decode>, the resulting UTF8 flag is B<on>--I<unless> you can -unambiguously represent data. Here is what we mean by "unambiguously". -After C<$str = decode("foo", $octet)>, - - When $octet is... The UTF8 flag in $str is - --------------------------------------------- - In ASCII only (or EBCDIC only) OFF - In ISO-8859-1 ON - In any other Encoding ON - --------------------------------------------- - -As you see, there is one exception: in ASCII. That way you can assume -Goal #1. And with C<Encode>, Goal #2 is assumed but you still have to be -careful in the cases mentioned in the B<CAVEAT> paragraphs above. - This UTF8 flag is not visible in Perl scripts, exactly for the same reason you cannot (or rather, you I<don't have to>) see whether a scalar contains a string, an integer, or a floating-point number. But you can still peek and poke these if you will. See the next section. -=back - =head2 Messing with Perl's Internals The following API uses parts of Perl's internals in the current -- 2.11.0

Mon Oct 23 18:16:54 2017 pali [...] cpan.org - Correspondence added

On Pia Sep 29 08:41:16 2017, PALI wrote: Show quoted text

> See proposed change in the attachment.

Any comments?

Tue Jan 09 14:45:37 2018 pali [...] cpan.org - Fixed in 2.94 added

Tue Jan 09 14:46:14 2018 pali [...] cpan.org - Correspondence added

Patch was included in Encode 2.94.

Mon Jan 15 23:33:57 2018 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'