Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 34259
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: MSCHILLI [...] cpan.org
Cc: pali [...] cpan.org
AdminCc:

Bug Information
Severity: Important
Broken in: 2.24
Fixed in: 2.94



Subject: Utf8 flag on after decoding 100% ASCII data
Hi Dan, thanks for Encode, it's a great module! My collegue Richard Russo has found a case where Encode decodes 100% ASCII data and subsequently sets the utf8 flag: my $string = "191501885"; my $id = decode_utf8( $string ); print "$id " , Encode::is_utf8($id), "\n"; $id = decode ( "utf8", $string ); print "$id " , Encode::is_utf8($id), "\n"; yields 191501885 1 191501885 1 while according to the documentation, strings that are 100% ascii shouldn't have the utf8 flag on after they're utf8-decoded. Note that the string contains a 100% ASCII string and not a number. Would be great if you could take a look -- thanks! -- Mike
I consider the behavior natural. Consider the case below. while(<>){ my $utf8 = decode_utf8($_); # .... } The subsequent code must be written conditionally if decode_utf8 conditionally sets the flag. Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text
> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >
Show quoted text
> I consider the behavior natural. Consider the case below. > > while(<>){ > my $utf8 = decode_utf8($_); > # .... > } > > The subsequent code must be written conditionally if decode_utf8 > conditionally sets the flag.
Sorry to be a pain, but this is complete garbage! The Encode documentation even goes into great detail to explain this. Look at the "The UTF8 flag" section: http://search.cpan.org/~dankogai/Encode-2.25/Encode.pm --- Goal #1: Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on. Goal #2: Old byte-oriented programs should magically start working on the new character-oriented data when appropriate. Goal #3: Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode. ... # When you decode, the resulting UTF8 flag is on unless you can unambiguously represent data. Here is the definition of dis-ambiguity. After $utf8 = decode('foo', $octet);, When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF .. As you see, there is one exception, In ASCII. That way you can assume Goal #1. And with Encode Goal #2 is assumed but you still have to be careful in such cases mentioned in CAVEAT paragraphs. --- So this bug is actually a bug. The data is ASCII only, so the UTF8 should be OFF after the decode. And fixing this bug should NOT be changing the documentation. This slows down the case where there is ASCII only data.
Show quoted text
> I consider the behavior natural. Consider the case below. > > while(<>){ > my $utf8 = decode_utf8($_); > # .... > } > > The subsequent code must be written conditionally if decode_utf8 > conditionally sets the flag.
I might add that this is not true either, there should be no need for conditional code at all, the whole point is that programs don't need to look at the UTF8 flag. If you have a string that's pure ASCII and has the UTF8 flag ON, then you can at any time join it with a string with the UTF8 flag that is ON, and it will be promoted just fine. my $asciistr = "hello"; # UTF8 flag OFF my $utf8octets1 = "hello"; # UTF8 flag OFF my $utf8octets2 = "\342\230\272"; # UTF8 flag OFF my $perlstr1 = decode_utf8($utf8octets1) # UTF8 flag OFF my $perlstr2 = decode_utf8($utf8octets2) # UTF8 flag ON my $perlstr3 = "\x{263a}"; # UTF8 flag ON my $result1 = $perlstr1 . $perlstr2; # UTF8 flag ON my $result2 = $perlstr1 . $asciistr; # UTF8 flag OFF All works just fine. The point is that if you work with data, even if it's incoming utf-8 data that you decode_utf8() to create a "perl string", then if that data was only ASCII data, it's a perl string with the UTF8 flag OFF and you get all the "fast" performance of octets. Only if you use non-ASCII chars do you actually pay the performance cost of perl strings with the UTF8 flag being on.
That one is tough to cope with because in encode (whatever -> utf8), transcoder is set so that it only complains the first byte that is malformed while decode (utf8 -> whatever) complains the whole unicode. The problem is that the transcoder is shared with other encodings so fixing this may break other encodings. I'll leave this ticket open till I come up with something better. Dan the Encode Maintainer On Wed May 14 23:45:53 2008, ROBM wrote: Show quoted text
> > I consider the behavior natural. Consider the case below. > > > > while(<>){ > > my $utf8 = decode_utf8($_); > > # .... > > } > > > > The subsequent code must be written conditionally if decode_utf8 > > conditionally sets the flag.
> > I might add that this is not true either, there should be no need for > conditional code at all, the whole point is that programs don't need to > look at the UTF8 flag. > > If you have a string that's pure ASCII and has the UTF8 flag ON, then > you can at any time join it with a string with the UTF8 flag that is ON, > and it will be promoted just fine. > > my $asciistr = "hello"; # UTF8 flag OFF > my $utf8octets1 = "hello"; # UTF8 flag OFF > my $utf8octets2 = "\342\230\272"; # UTF8 flag OFF > > my $perlstr1 = decode_utf8($utf8octets1) # UTF8 flag OFF > my $perlstr2 = decode_utf8($utf8octets2) # UTF8 flag ON > > my $perlstr3 = "\x{263a}"; # UTF8 flag ON > > my $result1 = $perlstr1 . $perlstr2; # UTF8 flag ON > my $result2 = $perlstr1 . $asciistr; # UTF8 flag OFF > > All works just fine. > > The point is that if you work with data, even if it's incoming utf-8 > data that you decode_utf8() to create a "perl string", then if that data > was only ASCII data, it's a perl string with the UTF8 flag OFF and you > get all the "fast" performance of octets. > > Only if you use non-ASCII chars do you actually pay the performance cost > of perl strings with the UTF8 flag being on.
On Tue Jul 01 16:09:09 2008, DANKOGAI wrote: Show quoted text
> That one is tough to cope with because in encode (whatever -> utf8), > transcoder is set so > that it only complains the first byte that is malformed while decode > (utf8 -> whatever) > complains the whole unicode. The problem is that the transcoder is > shared with other > encodings so fixing this may break other encodings.
I would have thought the solution is to keep some "found_non_ascii" (default 0) kind of flag in the transcoder when converting whatever -> utf8. If during the coversion you find a non-ascii output char (eg codepoint >=0x80), you set the flag. At the end of the conversion, you set the perl utf-8 flag on the string to on/off based on the "found_non_ascii" flag? Of course, I don't know the code, so I might be speaking rubbish... Rob
Document added in 2.27. See also #41163. Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text
> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >
On Wed Jan 21 17:22:03 2009, DANKOGAI wrote: Show quoted text
> Document added in 2.27. See also #41163. > > Dan the Encode Maintainer
This hasn't been resolved at all. Doing a diff -ru between 2.26 and 2.27 shows nothing in the changed in the documentation about this problem. In fact there's no mention of bug 34259 anywhere. Worse, the documentation for Encode still clearly states that decoding ASCII only data will return a perl string with the utf-8 flag OFF. Read this section: http://search.cpan.org/~dankogai/Encode/Encode.pm#The_UTF8_flag But when you test it, clearly still doesn't do what it says: $ perl -le 'use Encode; print $Encode::VERSION; print Encode::is_utf8(decode_utf8("blah"));' 2.27 1
From: bryce2 [...] obviously.com
I just spent hours on this. As of Perl 5.10.1, this bug is still present: <code> perl -le 'use Encode; print $Encode::VERSION; print Encode::is_utf8(decode_utf8("blah"));' 2.23 1 </code>
Looks like you just forgot to "use utf8". For compatibility's sake, Perl takes all scripts written in ISO-8859-1 unless you say "use utf8". perldoc perluniintro for details Dan the Encode Maintainer On Wed Mar 19 17:29:36 2008, MSCHILLI wrote: Show quoted text
> Hi Dan, > > thanks for Encode, it's a great module! My collegue Richard Russo has > found a case where Encode decodes 100% ASCII data and subsequently sets > the utf8 flag: > > my $string = "191501885"; > > my $id = decode_utf8( $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > $id = decode ( "utf8", $string ); > print "$id " , Encode::is_utf8($id), "\n"; > > yields > > 191501885 1 > 191501885 1 > > while according to the documentation, strings that are 100% ascii > shouldn't have the utf8 flag on after they're utf8-decoded. Note that > the string contains a 100% ASCII string and not a number. > > Would be great if you could take a look -- thanks! > > -- Mike >
On Thu Jul 07 01:15:48 2011, DANKOGAI wrote: Show quoted text
> Looks like you just forgot to "use utf8". For compatibility's sake, > Perl takes all scripts written in ISO-8859-1 unless you say "use utf8".
Sorry, this doesn't make any sense in this context. "use utf8" is irrelevant if your program uses plain ASCII strings, as the snippets of code presented in this bug all do. The problem is that Encode isn't behaving according to its documentation.
There are three possible solutions to this: 1) Change all encodings to keep track of whether or not a code point above U+007F has been decoded and SvUTF8_(?on|off) accordingly 2) Change Encode::decode() to scan decoded strings for code points above U+007F and SvUTF8_off if no code points are above U+007F 3) Change documentation 1 or 2 isn't beneficial to me since most of my data contain Basic Latin and Latin-1 Supplement characters (Swedish), with occasional Miscellaneous Symbols and General Punctuation. The question is if it's worth the overhead, even English texts makes more and more use of General Punctuation. -- chansen
RT-Send-CC: chansen [...] cpan.org
On Sat Nov 12 17:32:33 2011, CHANSEN wrote: Show quoted text
> There are three possible solutions to this: > 1) Change all encodings to keep track of whether or not a code point > above > U+007F has been decoded and SvUTF8_(?on|off) accordingly > 2) Change Encode::decode() to scan decoded strings for code points > above > U+007F and SvUTF8_off if no code points are above U+007F > 3) Change documentation
I would recommend changing the documentation and possibly removing the whole section on the UTF8 flag altogether. The UTF8 flag is internal to Perl, or at least it is meant to be. All this discussion about it has led to much misunderstanding and chagrin over the years. The way it’s supposed to be is: A string is a string is a string. Previously the max char was 255. Now it’s higher. Show quoted text
> > 1 or 2 isn't beneficial to me since most of my data contain Basic > Latin and > Latin-1 Supplement characters (Swedish), with occasional Miscellaneous > Symbols > and General Punctuation. > > The question is if it's worth the overhead, even English texts makes > more > and more use of General Punctuation. > > -- > chansen
Hm. Not sure here. Modules like MIME::Base64 and Digest::SHA (newer versions) die with error if see a string with utf8 bit set. (that looks correct as those functions are defined only for bytes, not for characters). People obviously need to control utf8 bit. On Sun Nov 13 06:32:47 2011, SPROUT wrote: Show quoted text
> On Sat Nov 12 17:32:33 2011, CHANSEN wrote:
> > There are three possible solutions to this: > > 1) Change all encodings to keep track of whether or not a code point > > above > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > 2) Change Encode::decode() to scan decoded strings for code points > > above > > U+007F and SvUTF8_off if no code points are above U+007F > > 3) Change documentation
> > I would recommend changing the documentation and possibly removing the > whole section on > the UTF8 flag altogether. > > The UTF8 flag is internal to Perl, or at least it is meant to be. All > this discussion about it has > led to much misunderstanding and chagrin over the years. The way it’s > supposed to be is: A > string is a string is a string. Previously the max char was 255. Now > it’s higher. >
> > > > 1 or 2 isn't beneficial to me since most of my data contain Basic > > Latin and > > Latin-1 Supplement characters (Swedish), with occasional
> Miscellaneous
> > Symbols > > and General Punctuation. > > > > The question is if it's worth the overhead, even English texts makes > > more > > and more use of General Punctuation. > > > > -- > > chansen
> >
From: victor [...] vsespb.ru
Show quoted text
> Modules like MIME::Base64 and Digest::SHA (newer versions) die with error if see a string with utf8 bit set.
ignore this, this is just wrong On Thu Jan 03 19:28:20 2013, vsespb wrote: Show quoted text
> Hm. Not sure here. > > Modules like MIME::Base64 and Digest::SHA (newer versions) die with > error if see a string with utf8 bit set. (that looks correct as those > functions are defined only for bytes, not for characters). > > People obviously need to control utf8 bit. > > On Sun Nov 13 06:32:47 2011, SPROUT wrote:
> > On Sat Nov 12 17:32:33 2011, CHANSEN wrote:
> > > There are three possible solutions to this: > > > 1) Change all encodings to keep track of whether or not a code point > > > above > > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > > 2) Change Encode::decode() to scan decoded strings for code points > > > above > > > U+007F and SvUTF8_off if no code points are above U+007F > > > 3) Change documentation
> > > > I would recommend changing the documentation and possibly removing the > > whole section on > > the UTF8 flag altogether. > > > > The UTF8 flag is internal to Perl, or at least it is meant to be. All > > this discussion about it has > > led to much misunderstanding and chagrin over the years. The way it’s > > supposed to be is: A > > string is a string is a string. Previously the max char was 255. Now > > it’s higher. > >
> > > > > > 1 or 2 isn't beneficial to me since most of my data contain Basic > > > Latin and > > > Latin-1 Supplement characters (Swedish), with occasional
> > Miscellaneous
> > > Symbols > > > and General Punctuation. > > > > > > The question is if it's worth the overhead, even English texts makes > > > more > > > and more use of General Punctuation. > > > > > > -- > > > chansen
> > > >
> >
From: victor [...] vsespb.ru
On Thu May 15 07:31:57 2008, ROBM wrote: Show quoted text
> --- > Goal #1: > > Old byte-oriented programs should not spontaneously break on the old > byte-oriented data they used to work on. >
Show quoted text
> As you see, there is one exception, In ASCII. That way you can assume > Goal #1. And with Encode Goal #2 is assumed but you still have to be > careful in such cases mentioned in CAVEAT paragraphs.
Show quoted text
> So this bug is actually a bug. The data is ASCII only, so the UTF8 > should be OFF after the decode. > > And fixing this bug should NOT be changing the documentation. This slows > down the case where there is ASCII only data.
I think point about Goal #1 is invalid here. ASCII data can get utf-8 flag, for example, when splitting non-ASCII string (with flag on) to ASCII and non-ASCII parts. ASCII part will have utf8 bit on. Also, "Old byte-oriented" programs never deal with decode() and with any Unicode data, so they are not affected. So, IMHO documentation about ASCII flag behaviour should be dropped (however it's better add notice to CAVEATS)
On Sob Nov 12 21:32:47 2011, SPROUT wrote: Show quoted text
> On Sat Nov 12 17:32:33 2011, CHANSEN wrote:
> > There are three possible solutions to this: > > 1) Change all encodings to keep track of whether or not a code point > > above > > U+007F has been decoded and SvUTF8_(?on|off) accordingly > > 2) Change Encode::decode() to scan decoded strings for code points > > above > > U+007F and SvUTF8_off if no code points are above U+007F > > 3) Change documentation
> > I would recommend changing the documentation and possibly removing the > whole section on > the UTF8 flag altogether.
I'm for removing documentation for whole section "The UTF8 flag". Just "This UTF8 flag is not visible in Perl scripts..." does not have to be removed and could be moved to section "Messing with Perl's Internals". Show quoted text
> The UTF8 flag is internal to Perl, or at least it is meant to be. All > this discussion about it has > led to much misunderstanding and chagrin over the years. The way it’s > supposed to be is: A > string is a string is a string. Previously the max char was 255. Now > it’s higher.
+1
Subject: [rt.cpan.org #34259] +1 on retaining current behaviour
Date: Sat, 23 Sep 2017 12:17:51 -0700
To: bug-Encode [...] rt.cpan.org
From: Will Storey <will [...] summercat.com>
I think the current behaviour makes sense. For one thing, it can help check whether you've decoded. It is just generally simpler to understand as well. I'm in favour of updating the documentation. We could remove everything in the section after "When you decode, the resulting UTF8 flag is on".
See proposed change in the attachment.
Subject: 0001-Remove-misleading-documentation-about-UTF8-flag.patch
From 1116e0fc2c195e2135e9353d90bdfcfa228e9fbf Mon Sep 17 00:00:00 2001 From: Pali <pali@cpan.org> Date: Fri, 29 Sep 2017 14:39:36 +0200 Subject: [PATCH] Remove misleading documentation about UTF8 flag --- Encode.pm | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/Encode.pm b/Encode.pm index 6ed4a77..faf1f58 100644 --- a/Encode.pm +++ b/Encode.pm @@ -822,38 +822,11 @@ different kinds of strings and string-operations in Perl: one a byte-oriented mode for when the internal UTF8 flag is off, and the other a character-oriented mode for when the internal UTF8 flag is on. -Here is how C<Encode> handles the UTF8 flag. - -=over 2 - -=item * - -When you I<encode>, the resulting UTF8 flag is always B<off>. - -=item * - -When you I<decode>, the resulting UTF8 flag is B<on>--I<unless> you can -unambiguously represent data. Here is what we mean by "unambiguously". -After C<$str = decode("foo", $octet)>, - - When $octet is... The UTF8 flag in $str is - --------------------------------------------- - In ASCII only (or EBCDIC only) OFF - In ISO-8859-1 ON - In any other Encoding ON - --------------------------------------------- - -As you see, there is one exception: in ASCII. That way you can assume -Goal #1. And with C<Encode>, Goal #2 is assumed but you still have to be -careful in the cases mentioned in the B<CAVEAT> paragraphs above. - This UTF8 flag is not visible in Perl scripts, exactly for the same reason you cannot (or rather, you I<don't have to>) see whether a scalar contains a string, an integer, or a floating-point number. But you can still peek and poke these if you will. See the next section. -=back - =head2 Messing with Perl's Internals The following API uses parts of Perl's internals in the current -- 2.11.0
On Pia Sep 29 08:41:16 2017, PALI wrote: Show quoted text
> See proposed change in the attachment.
Any comments?
Patch was included in Encode 2.94.