Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 87267
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: MARKF [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 2.51
  • 2.40
  • 2.41
  • 2.42
  • 2.43
  • 2.44
  • 2.45
  • 2.46
  • 2.47
  • 2.48
  • 2.49
  • 2.50
Fixed in: 2.54



Subject: decode_utf8 doesn't do the same as decode("utf8")
The decode_utf8 doesn't do the same as decode("utf8",...) for all inputs despite the documentation explicitly saying that $string = decode_utf8($octets [, CHECK]); Equivalent to "$string = decode("utf8", $octets [, CHECK])". It acts differently when $octets has the UTF-8 flag turned on. decode("utf8",...) treats each character in the string as a byte. decode_utf8 simply returns the string unaltered. Failing test suite attached.
Subject: decode_utf_bug.t
#!/usr/bin/env perl use strict; use warnings; use Encode; use Test::More tests => 4; # decode_utf8(...) and decode('utf8',...) are MEANT TO BE THE SAME # from the perldoc for Encode: # # $string = decode_utf8($octets [, CHECK]); # Equivalent to "$string = decode("utf8", $octets [, CHECK])". ####### # decode_utf8($bytes) ####### { my $bytes = "test:\x{ee}\x{80}\x{80}"; my $chars = Encode::decode_utf8($bytes); is($chars, "test:\x{e000}", "decode_utf8 without utf-8 flag"); } { my $bytes = "test:\x{ee}\x{80}\x{80}"; # do something that makes the utf-8 flag turn on without # altering the contents of the string $bytes .= "\x{2603}"; chop $bytes; my $chars = Encode::decode_utf8($bytes); is($chars, "test:\x{e000}", "decode_utf8 with utf-8 flag"); } ####### # decode("utf8",$bytes) ####### { my $bytes = "test:\x{ee}\x{80}\x{80}"; my $chars = Encode::decode("utf-8",$bytes); is($chars, "test:\x{e000}", "decode('utf8',...) without utf-8 flag"); } { my $bytes = "test:\x{ee}\x{80}\x{80}"; # do something that makes the utf-8 flag turn on without # altering the contents of the string $bytes .= "\x{2603}"; chop $bytes; my $chars = Encode::decode("utf-8",$bytes); is($chars, "test:\x{e000}", "decode('utf8',...) with utf-8 flag"); }
It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag turned on. And while the document says "equivalent", it does not say "identical". Encode.pm defines decode_utf8 as follows: sub decode_utf8($;$) { my ( $octets, $check ) = @_; return $octets if is_utf8($octets); return undef unless defined $octets; $octets .= '' if ref $octets; $check ||= 0; $utf8enc ||= find_encoding('utf8'); my $string = $utf8enc->decode( $octets, $check ); $_[0] = $octets if $check and !ref $check and !( $check & LEAVE_SRC() ); return $string; } Dan the Encode Maintainer On Wed Jul 24 15:03:37 2013, MARKF wrote: Show quoted text
> The decode_utf8 doesn't do the same as decode("utf8",...) for all > inputs despite the documentation explicitly saying that > > $string = decode_utf8($octets [, CHECK]); > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > It acts differently when $octets has the UTF-8 flag turned on. > decode("utf8",...) treats each character in the string as a byte. > decode_utf8 simply returns the string unaltered. > > Failing test suite attached.
From: victor [...] vsespb.ru
IMHO it's not "equivalent", nor "identical". Maybe "similar", but difference should be described in documentation. Also, encode_utf8 is actually acts like encode("utf-8"), while described as "Equivalent" too. On Thu Jul 25 07:37:24 2013, DANKOGAI wrote: Show quoted text
> It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag > turned on. And while the document says "equivalent", it does not say > "identical". Encode.pm defines decode_utf8 as follows: > > sub decode_utf8($;$) { > my ( $octets, $check ) = @_; > return $octets if is_utf8($octets); > return undef unless defined $octets; > $octets .= '' if ref $octets; > $check ||= 0; > $utf8enc ||= find_encoding('utf8'); > my $string = $utf8enc->decode( $octets, $check ); > $_[0] = $octets if $check and !ref $check and !( $check & > LEAVE_SRC() ); > return $string; > } > > Dan the Encode Maintainer > > On Wed Jul 24 15:03:37 2013, MARKF wrote:
> > The decode_utf8 doesn't do the same as decode("utf8",...) for all > > inputs despite the documentation explicitly saying that > > > > $string = decode_utf8($octets [, CHECK]); > > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > > > It acts differently when $octets has the UTF-8 flag turned on. > > decode("utf8",...) treats each character in the string as a byte. > > decode_utf8 simply returns the string unaltered. > > > > Failing test suite attached.
From: victor [...] vsespb.ru
btw the following example prints different results, depending on $ARGV[0] =============== use Encode; use Devel::Peek; use utf8; my ($x, undef) = split(' ', decode("UTF-8", "X \xc2\xc6")); my $s = "\xc2\xb5"; die unless $x eq 'X'; if (1 == $ARGV[0]) { $s .= $x; } else { $s .= 'X'; } Dump decode_utf8("$s"); Dump decode("UTF-8", "$s"); __END__ With ARGV[0] == 1 SV = PV(0x20f87f8) at 0x2013aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x201b140 "\303\202\302\265X"\0 [UTF8 "\x{c2}\x{b5}X"] CUR = 5 LEN = 8 SV = PV(0x20f87d8) at 0x2013aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x2008400 "\302\265X"\0 [UTF8 "\x{b5}X"] CUR = 3 LEN = 8 with ARGV[0] == 2 SV = PV(0x11e67f8) at 0x1101aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x10f7380 "\302\265X"\0 [UTF8 "\x{b5}X"] CUR = 3 LEN = 8 SV = PV(0x1283c68) at 0x1101aa0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x10f6400 "\302\265X"\0 [UTF8 "\x{b5}X"] CUR = 3 LEN = 8 so, it missing documentation can cause hidden errors in some circumstances. On Fri Aug 16 21:24:06 2013, vsespb wrote: Show quoted text
> IMHO it's not "equivalent", nor "identical". Maybe "similar", but > difference should be described in documentation. > Also, encode_utf8 is actually acts like encode("utf-8"), while > described as "Equivalent" too. > > > On Thu Jul 25 07:37:24 2013, DANKOGAI wrote:
> > It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag > > turned on. And while the document says "equivalent", it does not say > > "identical". Encode.pm defines decode_utf8 as follows: > > > > sub decode_utf8($;$) { > > my ( $octets, $check ) = @_; > > return $octets if is_utf8($octets); > > return undef unless defined $octets; > > $octets .= '' if ref $octets; > > $check ||= 0; > > $utf8enc ||= find_encoding('utf8'); > > my $string = $utf8enc->decode( $octets, $check ); > > $_[0] = $octets if $check and !ref $check and !( $check & > > LEAVE_SRC() ); > > return $string; > > } > > > > Dan the Encode Maintainer > > > > On Wed Jul 24 15:03:37 2013, MARKF wrote:
> > > The decode_utf8 doesn't do the same as decode("utf8",...) for all > > > inputs despite the documentation explicitly saying that > > > > > > $string = decode_utf8($octets [, CHECK]); > > > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > > > > > It acts differently when $octets has the UTF-8 flag turned on. > > > decode("utf8",...) treats each character in the string as a byte. > > > decode_utf8 simply returns the string unaltered. > > > > > > Failing test suite attached.
From: victor [...] vsespb.ru
Equivalent (and Identical) ticket https://rt.cpan.org/Public/Bug/Display.html?id=61671 On Fri Aug 16 21:37:05 2013, vsespb wrote: Show quoted text
> btw the following example prints different results, depending on $ARGV[0] > > =============== > use Encode; > use Devel::Peek; > use utf8; > > my ($x, undef) = split(' ', decode("UTF-8", "X \xc2\xc6")); > > my $s = "\xc2\xb5"; > > > die unless $x eq 'X'; > if (1 == $ARGV[0]) { > $s .= $x; > } else { > $s .= 'X'; > } > > > Dump decode_utf8("$s"); > Dump decode("UTF-8", "$s"); > __END__ > > With ARGV[0] == 1 > > SV = PV(0x20f87f8) at 0x2013aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x201b140 "\303\202\302\265X"\0 [UTF8 "\x{c2}\x{b5}X"] > CUR = 5 > LEN = 8 > SV = PV(0x20f87d8) at 0x2013aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x2008400 "\302\265X"\0 [UTF8 "\x{b5}X"] > CUR = 3 > LEN = 8 > > with ARGV[0] == 2 > > SV = PV(0x11e67f8) at 0x1101aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x10f7380 "\302\265X"\0 [UTF8 "\x{b5}X"] > CUR = 3 > LEN = 8 > SV = PV(0x1283c68) at 0x1101aa0 > REFCNT = 1 > FLAGS = (TEMP,POK,pPOK,UTF8) > PV = 0x10f6400 "\302\265X"\0 [UTF8 "\x{b5}X"] > CUR = 3 > LEN = 8 > > so, it missing documentation can cause hidden errors in some circumstances. > > On Fri Aug 16 21:24:06 2013, vsespb wrote:
> > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > difference should be described in documentation. > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > described as "Equivalent" too. > > > > > > On Thu Jul 25 07:37:24 2013, DANKOGAI wrote:
> > > It's because decode_utf8($bytes) does nothing if $bytes has utf8 flag > > > turned on. And while the document says "equivalent", it does not say > > > "identical". Encode.pm defines decode_utf8 as follows: > > > > > > sub decode_utf8($;$) { > > > my ( $octets, $check ) = @_; > > > return $octets if is_utf8($octets); > > > return undef unless defined $octets; > > > $octets .= '' if ref $octets; > > > $check ||= 0; > > > $utf8enc ||= find_encoding('utf8'); > > > my $string = $utf8enc->decode( $octets, $check ); > > > $_[0] = $octets if $check and !ref $check and !( $check & > > > LEAVE_SRC() ); > > > return $string; > > > } > > > > > > Dan the Encode Maintainer > > > > > > On Wed Jul 24 15:03:37 2013, MARKF wrote:
> > > > The decode_utf8 doesn't do the same as decode("utf8",...) for all > > > > inputs despite the documentation explicitly saying that > > > > > > > > $string = decode_utf8($octets [, CHECK]); > > > > Equivalent to "$string = decode("utf8", $octets [, CHECK])". > > > > > > > > It acts differently when $octets has the UTF-8 flag turned on. > > > > decode("utf8",...) treats each character in the string as a byte. > > > > decode_utf8 simply returns the string unaltered. > > > > > > > > Failing test suite attached.
> >
+1 to get this check eliminated. Pull request open here: https://github.com/dankogai/p5-encode/pull/11 On Fri Aug 16 13:24:06 2013, vsespb wrote: Show quoted text
> IMHO it's not "equivalent", nor "identical". Maybe "similar", but > difference should be described in documentation. > Also, encode_utf8 is actually acts like encode("utf-8"), while > described as "Equivalent" too.
From: victor [...] vsespb.ru
Or, alternative pull request - just document current behaviour: https://github.com/dankogai/p5-encode/pull/10 On Mon Aug 26 06:34:45 2013, MIYAGAWA wrote: Show quoted text
> +1 to get this check eliminated. > > Pull request open here: https://github.com/dankogai/p5-encode/pull/11 > > On Fri Aug 16 13:24:06 2013, vsespb wrote:
> > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > difference should be described in documentation. > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > described as "Equivalent" too.
>
I have merged https://github.com/dankogai/p5-encode/pull/11 https://github.com/dankogai/p5-encode/pull/10 Dan the Maintainer Thereof On Tue Aug 27 06:05:45 2013, vsespb wrote: Show quoted text
> Or, alternative pull request - just document current behaviour: > https://github.com/dankogai/p5-encode/pull/10 > > On Mon Aug 26 06:34:45 2013, MIYAGAWA wrote:
> > +1 to get this check eliminated. > > > > Pull request open here: https://github.com/dankogai/p5-encode/pull/11 > > > > On Fri Aug 16 13:24:06 2013, vsespb wrote:
> > > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > > difference should be described in documentation. > > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > > described as "Equivalent" too.
> >
From: victor [...] vsespb.ru
Why did you merge Both??? They contradict each other !! On Thu Aug 29 18:52:11 2013, DANKOGAI wrote: Show quoted text
> I have merged > > https://github.com/dankogai/p5-encode/pull/11 > https://github.com/dankogai/p5-encode/pull/10 > > Dan the Maintainer Thereof > > On Tue Aug 27 06:05:45 2013, vsespb wrote:
> > Or, alternative pull request - just document current behaviour: > > https://github.com/dankogai/p5-encode/pull/10 > > > > On Mon Aug 26 06:34:45 2013, MIYAGAWA wrote:
> > > +1 to get this check eliminated. > > > > > > Pull request open here: https://github.com/dankogai/p5-encode/pull/11 > > > > > > On Fri Aug 16 13:24:06 2013, vsespb wrote:
> > > > IMHO it's not "equivalent", nor "identical". Maybe "similar", but > > > > difference should be described in documentation. > > > > Also, encode_utf8 is actually acts like encode("utf-8"), while > > > > described as "Equivalent" too.
> > >
>
Fixed with release of 2.54 (which reverted some of the documentation changes)