Bug #34528 for CGI: utf-8 related difference between 5.8.8 and 5.10.0

Sat Mar 29 11:09:27 2008 SREZIC [...] cpan.org - Ticket created

Subject:

utf-8 related difference between 5.8.8 and 5.10.0

Creating a query string on the fly with highbit characters creates a different result with perl 5.8.8 and 5.10.0, both with the newest CGI.pm: eserte@biokovo (build/CGI.pm-3.35-YItPA8): perl5.8.8 -Mblib -MCGI -e 'warn CGI->new({a=>"\xfc"})->query_string' a=%FC at -e line 1. eserte@biokovo (build/CGI.pm-3.35-oneeIe): perl5.10.0 -Mblib -MCGI -e 'warn CGI->new({a=>"\xfc"})->query_string' a=%C3%BC at -e line 1. Regards, Slaven

Mon Mar 31 11:05:03 2008 LDS [...] cpan.org - Status changed from 'new' to 'open'

Mon Mar 31 11:05:05 2008 LDS [...] cpan.org - Given to LDS

Mon Mar 31 11:05:28 2008 LDS [...] cpan.org - Correspondence added

Any idea which is the "correct" behavior?

Thu Dec 04 06:49:26 2008 mmaslano [...] redhat.com - Correspondence added

From:

mmaslano [...] redhat.com

On Mon Mar 31 11:05:28 2008, LDS wrote: Show quoted text

> Any idea which is the "correct" behavior?

I used different reproducer for with "readable" output: perl -MCGI -e '$t="peříčko"; my $te=CGI::escape($t);my $cgi_params=new CGI("text=$te");my $p=$cgi_params->param('text'); print{*STDERR} "P=$p LEN=",length($p),"\n";' output with perl-5.8.8 P=peříčko LEN=10 output with perl-5.10.0 P=peÅÃÄko LEN=16 Could it be result of this change? http://perldoc.perl.org/perldelta.html#Packing-and-UTF-8-strings I suppose this could be fixed in function escape in Util.pm, but I didn't find solution yet.

Thu Dec 04 09:31:18 2008 LDS [...] cpan.org - Correspondence added

I don't have a perl 5.10 handy to debug with. Could someone try replacing escape() and unescape() with the following simple versions? sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $toencode = shift; return undef unless defined($toencode); } On Thu Dec 04 06:49:26 2008, mmaslano@redhat.com wrote: Show quoted text

> On Mon Mar 31 11:05:28 2008, LDS wrote:

> > Any idea which is the "correct" behavior?

> > I used different reproducer for with "readable" output: > perl -MCGI -e '$t="peříčko"; my $te=CGI::escape($t);my $cgi_params=new > CGI("text=$te");my $p=$cgi_params->param('text'); print{*STDERR} "P=$p > LEN=",length($p),"\n";' > > output with perl-5.8.8 > P=peříčko LEN=10 > output with perl-5.10.0 > P=peÅÃÄko LEN=16 > > Could it be result of this change? > http://perldoc.perl.org/perldelta.html#Packing-and-UTF-8-strings > I suppose this could be fixed in function escape in Util.pm, but I > didn't find solution yet.

Thu Dec 04 09:34:10 2008 LDS [...] cpan.org - Correspondence added

Browser submitted too fast.... try again Here are suggested replacements: sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $toencode = shift; return undef unless defined($toencode); $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; return $toencode; } sub unescape { shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $todecode = shift; return undef unless defined($todecode); $todecode =~ tr/+/ /; # pluses become spaces $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge; } Basically, this rips out all the UTF handling logic. I'd like to see what happens. On Thu Dec 04 09:31:18 2008, LDS wrote: Show quoted text

> I don't have a perl 5.10 handy to debug with. Could someone try > replacing escape() and unescape() with the following simple versions? > > sub escape { > shift() if @_ > 1 and ( ref($_[0]) || > (defined $_[1] && $_[0] eq $CGI::DefaultClass)); > my $toencode = shift; > return undef unless defined($toencode); > > } > > On Thu Dec 04 06:49:26 2008, mmaslano@redhat.com wrote:

> > On Mon Mar 31 11:05:28 2008, LDS wrote:

> > > Any idea which is the "correct" behavior?

> > > > I used different reproducer for with "readable" output: > > perl -MCGI -e '$t="peříčko"; my $te=CGI::escape($t);my $cgi_params=new > > CGI("text=$te");my $p=$cgi_params->param('text'); print{*STDERR} "P=$p > > LEN=",length($p),"\n";' > > > > output with perl-5.8.8 > > P=peříčko LEN=10 > > output with perl-5.10.0 > > P=peÅÃÄko LEN=16 > > > > Could it be result of this change? > > http://perldoc.perl.org/perldelta.html#Packing-and-UTF-8-strings > > I suppose this could be fixed in function escape in Util.pm, but I > > didn't find solution yet.

> >

Fri Dec 05 02:00:18 2008 mmaslano [...] redhat.com - Correspondence added

From:

mmaslano [...] redhat.com

On Thu Dec 04 09:34:10 2008, LDS wrote: Show quoted text

> Browser submitted too fast.... try again > > Here are suggested replacements: > > sub escape { > shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq > $CGI::DefaultClass)); > my $toencode = shift; > return undef unless defined($toencode); > $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; > return $toencode; > } > > sub unescape { > shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq > $CGI::DefaultClass)); > my $todecode = shift; > return undef unless defined($todecode); > $todecode =~ tr/+/ /; # pluses become spaces > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;

Here is missing one parenthesis '(' Show quoted text

> } >

But anyway also with parenthesis it's return: P= LEN=0

Fri Dec 05 09:47:29 2008 LDS [...] cpan.org - Correspondence added

On Fri Dec 05 02:00:18 2008, mmaslano@redhat.com wrote: Show quoted text

> On Thu Dec 04 09:34:10 2008, LDS wrote:

> > Browser submitted too fast.... try again > > > > Here are suggested replacements: > > > > sub escape { > > shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq > > $CGI::DefaultClass)); > > my $toencode = shift; > > return undef unless defined($toencode); > > $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; > > return $toencode; > > } > > > > sub unescape { > > shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq > > $CGI::DefaultClass)); > > my $todecode = shift; > > return undef unless defined($todecode); > > $todecode =~ tr/+/ /; # pluses become spaces > > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;

> Here is missing one parenthesis '('

> > } > >

> But anyway also with parenthesis it's return: P= LEN=0

Sorry, it's also missing "return $todecode" before the last parenthesis.

Tue Dec 09 08:39:00 2008 mmaslano [...] redhat.com - Correspondence added

From:

mmaslano [...] redhat.com

On Fri Dec 05 09:47:29 2008, LDS wrote: Show quoted text

> On Fri Dec 05 02:00:18 2008, mmaslano@redhat.com wrote:

> > On Thu Dec 04 09:34:10 2008, LDS wrote:

> > > Browser submitted too fast.... try again > > > > > > Here are suggested replacements: > > > > > > sub escape { > > > shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq > > > $CGI::DefaultClass)); > > > my $toencode = shift; > > > return undef unless defined($toencode); > > > $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; > > > return $toencode; > > > } > > > > > > sub unescape { > > > shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq > > > $CGI::DefaultClass)); > > > my $todecode = shift; > > > return undef unless defined($todecode); > > > $todecode =~ tr/+/ /; # pluses become spaces > > > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;

> > Here is missing one parenthesis '('

> > > } > > >

> > But anyway also with parenthesis it's return: P= LEN=0

> > Sorry, it's also missing "return $todecode" before the last parenthesis.

Hm I wasn't looking for next error after first one ;-) Here's the output. P=pe%C5%99%C3%AD%C4%8Dko LEN=22

Wed Mar 25 09:51:30 2009 skasal [...] redhat.com - Correspondence added

From:

kasal [...] ucw.cz

Hello LDS, I returned this this problem today. First, I can reproduce it as Marcela (mmaslano) has mentioned earlier, and also as it was reported in https://bugzilla.redhat.com/show_bug.cgi?id=472571 Show quoted text

> > > $todecode =~ tr/+/ /; # pluses become spaces > > > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;

I made it: $todecode =~ s/%([0-9a-fA-F]{2})/chr hex($1)/ge; return $todecode; With these modified versions of encode and decode, the bug disappears. Indeed, the problem seems to be in line $toencode = eval { pack("C*", unpack("U0C*", $toencode))} || pack("C*", unpack("C*", $toencode)); The problem seems to be in the eval: it succeeds, replacing each two-byte UTF-8 encoded string by four bytes. And that's something that won't disappear in the later processing. Is the eval supposed to succeed at all? Actually, if I comment out this one eval from escape() from the original CGI/Util.pm, things start working. But is this the right fix?

Mon Mar 30 12:37:08 2009 skasal [...] redhat.com - Correspondence added

From:

kasal [...] ucw.cz

I performed more experiments and it seems that though pack("C*", unpack("U0C*", $toencode)) did return the original UTF-8 encoded string in perl-5.8.8, it is not longer so in perl-5.10. This change is why UTF-8 strings get garbled by CGI::escape. The original intent was to have utf8::encode here; the problem is the same: when utf8::encode is called on an UTF-8 encoded string, the result is an invalid sequence of bytes. OTOH, I found out that not only pack("C*", unpack("C*", $toencode)) but also pack("U0C*", unpack("U0C*", $toencode)) is safe in this situation. So I'm going to put the latter in the Fedora perl and hope for the best.

Mon Mar 30 15:57:00 2009 LDS [...] cpan.org - Correspondence added

Could you send me the complete codes for encode and decode? I will incorporate it into the CGI.pm release. Lincoln On Mon Mar 30 12:37:08 2009, kasal wrote: Show quoted text

> I performed more experiments and it seems that though > pack("C*", unpack("U0C*", $toencode)) > did return the original UTF-8 encoded string in perl-5.8.8, it is not > longer so in perl-5.10. This change is why UTF-8 strings get garbled by > CGI::escape. > > The original intent was to have utf8::encode here; the problem is the > same: when utf8::encode is called on an UTF-8 encoded string, the result > is an invalid sequence of bytes. > > OTOH, I found out that not only > pack("C*", unpack("C*", $toencode)) > but also > pack("U0C*", unpack("U0C*", $toencode)) > is safe in this situation. > > So I'm going to put the latter in the Fedora perl and hope for the best.

Mon Apr 06 12:44:38 2009 skasal [...] redhat.com - Correspondence added

From:

skasal [...] redhat.com

On Mon Mar 30 15:57:00 2009, LDS wrote: Show quoted text

> Could you send me the complete codes for encode and decode? I will > incorporate it into the CGI.pm release.

I'm attaching a patch against CGI.pm-3.42. The code of escape is now: # URL-encode data # # We cannot use the %u escapes, they were rejected by W3C, so the official # way is %XX-escaped utf-8 encoding. # Naturally, Unicode strings have to be converted to their utf-8 byte # representation. (No action is required on 5.6.) # Byte strings were traditionally used directly as a sequence of octets. # This worked if they actually represented binary data (i.e. in CGI::Compress). # This also worked if these byte strings were actually utf-8 encoded; e.g., # when the source file used utf-8 without the apropriate "use utf8;". # This fails if the byte string is actually a Latin 1 encoded string, but it # was always so and cannot be fixed without breaking the binary data case. # -- Stepan Kasal <skasal@redhat.com> # sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::Default my $toencode = shift; return undef unless defined($toencode); utf8::encode($toencode) if ($] > 5.007 && utf8::is_utf8($toencode)); if ($EBCDIC) { $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg; } else { $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; } return $toencode; }

2009-04-06 Stepan Kasal <skasal@redhat.com> * t/util-58.t: Add tests reflecting common usage. * CGI/Util.pm (encode): State what conversions are needed, in accordance to the common usage mentioned above; and code it. diff -ur CGI.pm-3.42/CGI/Util.pm CGI.pm-3.42/CGI/Util.pm --- CGI.pm-3.42/CGI/Util.pm 2008-09-08 15:58:52.000000000 +0200 +++ CGI.pm-3.42/CGI/Util.pm 2009-04-04 16:30:29.000000000 +0200 @@ -210,7 +210,6 @@ my $todecode = shift; return undef unless defined($todecode); $todecode =~ tr/+/ /; # pluses become spaces - $EBCDIC = "\t" ne "\011"; if ($EBCDIC) { $todecode =~ s/%([0-9a-fA-F]{2})/chr $A2E[hex($1)]/ge; } else { @@ -232,16 +231,24 @@ } # URL-encode data +# +# We cannot use the %u escapes, they were rejected by W3C, so the official +# way is %XX-escaped utf-8 encoding. +# Naturally, Unicode strings have to be converted to their utf-8 byte +# representation. (No action is required on 5.6.) +# Byte strings were traditionally used directly as a sequence of octets. +# This worked if they actually represented binary data (i.e. in CGI::Compress). +# This also worked if these byte strings were actually utf-8 encoded; e.g., +# when the source file used utf-8 without the apropriate "use utf8;". +# This fails if the byte string is actually a Latin 1 encoded string, but it +# was always so and cannot be fixed without breaking the binary data case. +# -- Stepan Kasal <skasal@redhat.com> +# sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $toencode = shift; return undef unless defined($toencode); - $toencode = eval { pack("C*", unpack("U0C*", $toencode))} || pack("C*", unpack("C*", $toencode)); - - # force bytes while preserving backward compatibility -- dankogai - # but commented out because it was breaking CGI::Compress -- lstein - # $toencode = eval { pack("U*", unpack("U0C*", $toencode))} || pack("C*", unpack("C*", $toencode)); - + utf8::encode($toencode) if ($] > 5.007 && utf8::is_utf8($toencode)); if ($EBCDIC) { $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg; } else { diff -ur CGI.pm-3.42/t/util-58.t CGI.pm-3.42/t/util-58.t --- CGI.pm-3.42/t/util-58.t 2003-04-14 20:32:22.000000000 +0200 +++ CGI.pm-3.42/t/util-58.t 2009-04-06 16:49:42.000000000 +0200 @@ -1,16 +1,29 @@ +# test CGI::Util::escape +use Test::More tests => 4; +use_ok("CGI::Util"); + +# Byte strings should be escaped byte by byte: +# 1) not a valid utf-8 sequence: +my $uri = "pe\x{f8}\x{ed}\x{e8}ko.ogg"; +is(CGI::Util::escape($uri), "pe%F8%ED%E8ko.ogg", "Escape a Latin-2 string"); + +# 2) is a valid utf-8 sequence, but not an UTF-8-flagged string +# This happens often: people write utf-8 strings to source, but forget +# to tell perl about it by "use utf8;"--this is obviously wrong, but we +# have to handle it gracefully, for compatibility with GCI.pm under +# perl-5.8.x # -# This tests CGI::Util::escape() when fed with UTF-8-flagged string -# -- dankogai -BEGIN { - if ($] < 5.008) { - print "1..0 # \$] == $] < 5.008\n"; - exit(0); - } -} +$uri = "pe\x{c5}\x{99}\x{c3}\x{ad}\x{c4}\x{8d}ko.ogg"; +is(CGI::Util::escape($uri), "pe%C5%99%C3%AD%C4%8Dko.ogg", + "Escape an utf-8 byte string"); -use Test::More tests => 2; -use_ok("CGI::Util"); -my $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji -is(CGI::Util::escape($uri), "%E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt", - "# Escape string with UTF-8 flag"); +SKIP: +{ + # This tests CGI::Util::escape() when fed with UTF-8-flagged string + # -- dankogai + skip("Unicode strings not available in $]", 1) if ($] < 5.008); + $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji + is(CGI::Util::escape($uri), "%E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt", + "Escape string with UTF-8 flag"); +} __END__

Mon Apr 06 14:32:54 2009 LDS [...] cpan.org - Correspondence added

Thanks. The patch will be going into version 3.43

Mon Apr 06 14:32:56 2009 LDS [...] cpan.org - Status changed from 'open' to 'resolved'

Fri May 23 14:28:50 2014 The RT System itself - Queue changed from CGI.pm to CGI

Bug #34528 for CGI: utf-8 related difference between 5.8.8 and 5.10.0

Preferred bug tracker