Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the CGI CPAN distribution.

Report information
The Basics
Id: 34528
Status: resolved
Priority: 0/
Queue: CGI

People
Owner: LDS [...] cpan.org
Requestors: SREZIC [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: (no value)
Fixed in: (no value)



Subject: utf-8 related difference between 5.8.8 and 5.10.0
Creating a query string on the fly with highbit characters creates a different result with perl 5.8.8 and 5.10.0, both with the newest CGI.pm: eserte@biokovo (build/CGI.pm-3.35-YItPA8): perl5.8.8 -Mblib -MCGI -e 'warn CGI->new({a=>"\xfc"})->query_string' a=%FC at -e line 1. eserte@biokovo (build/CGI.pm-3.35-oneeIe): perl5.10.0 -Mblib -MCGI -e 'warn CGI->new({a=>"\xfc"})->query_string' a=%C3%BC at -e line 1. Regards, Slaven
Any idea which is the "correct" behavior?
From: mmaslano [...] redhat.com
On Mon Mar 31 11:05:28 2008, LDS wrote: Show quoted text
> Any idea which is the "correct" behavior?
I used different reproducer for with "readable" output: perl -MCGI -e '$t="peříčko"; my $te=CGI::escape($t);my $cgi_params=new CGI("text=$te");my $p=$cgi_params->param('text'); print{*STDERR} "P=$p LEN=",length($p),"\n";' output with perl-5.8.8 P=peříčko LEN=10 output with perl-5.10.0 P=peÅíÄko LEN=16 Could it be result of this change? http://perldoc.perl.org/perldelta.html#Packing-and-UTF-8-strings I suppose this could be fixed in function escape in Util.pm, but I didn't find solution yet.
I don't have a perl 5.10 handy to debug with. Could someone try replacing escape() and unescape() with the following simple versions? sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $toencode = shift; return undef unless defined($toencode); } On Thu Dec 04 06:49:26 2008, mmaslano@redhat.com wrote: Show quoted text
> On Mon Mar 31 11:05:28 2008, LDS wrote:
> > Any idea which is the "correct" behavior?
> > I used different reproducer for with "readable" output: > perl -MCGI -e '$t="peříčko"; my $te=CGI::escape($t);my $cgi_params=new > CGI("text=$te");my $p=$cgi_params->param('text'); print{*STDERR} "P=$p > LEN=",length($p),"\n";' > > output with perl-5.8.8 > P=peříčko LEN=10 > output with perl-5.10.0 > P=peÅíÄko LEN=16 > > Could it be result of this change? > http://perldoc.perl.org/perldelta.html#Packing-and-UTF-8-strings > I suppose this could be fixed in function escape in Util.pm, but I > didn't find solution yet.
Browser submitted too fast.... try again Here are suggested replacements: sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $toencode = shift; return undef unless defined($toencode); $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; return $toencode; } sub unescape { shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $todecode = shift; return undef unless defined($todecode); $todecode =~ tr/+/ /; # pluses become spaces $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge; } Basically, this rips out all the UTF handling logic. I'd like to see what happens. On Thu Dec 04 09:31:18 2008, LDS wrote: Show quoted text
> I don't have a perl 5.10 handy to debug with. Could someone try > replacing escape() and unescape() with the following simple versions? > > sub escape { > shift() if @_ > 1 and ( ref($_[0]) || > (defined $_[1] && $_[0] eq $CGI::DefaultClass)); > my $toencode = shift; > return undef unless defined($toencode); > > } > > On Thu Dec 04 06:49:26 2008, mmaslano@redhat.com wrote:
> > On Mon Mar 31 11:05:28 2008, LDS wrote:
> > > Any idea which is the "correct" behavior?
> > > > I used different reproducer for with "readable" output: > > perl -MCGI -e '$t="peříčko"; my $te=CGI::escape($t);my $cgi_params=new > > CGI("text=$te");my $p=$cgi_params->param('text'); print{*STDERR} "P=$p > > LEN=",length($p),"\n";' > > > > output with perl-5.8.8 > > P=peříčko LEN=10 > > output with perl-5.10.0 > > P=peÅíÄko LEN=16 > > > > Could it be result of this change? > > http://perldoc.perl.org/perldelta.html#Packing-and-UTF-8-strings > > I suppose this could be fixed in function escape in Util.pm, but I > > didn't find solution yet.
> >
From: mmaslano [...] redhat.com
On Thu Dec 04 09:34:10 2008, LDS wrote: Show quoted text
> Browser submitted too fast.... try again > > Here are suggested replacements: > > sub escape { > shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq > $CGI::DefaultClass)); > my $toencode = shift; > return undef unless defined($toencode); > $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; > return $toencode; > } > > sub unescape { > shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq > $CGI::DefaultClass)); > my $todecode = shift; > return undef unless defined($todecode); > $todecode =~ tr/+/ /; # pluses become spaces > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;
Here is missing one parenthesis '(' Show quoted text
> } >
But anyway also with parenthesis it's return: P= LEN=0
On Fri Dec 05 02:00:18 2008, mmaslano@redhat.com wrote: Show quoted text
> On Thu Dec 04 09:34:10 2008, LDS wrote:
> > Browser submitted too fast.... try again > > > > Here are suggested replacements: > > > > sub escape { > > shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq > > $CGI::DefaultClass)); > > my $toencode = shift; > > return undef unless defined($toencode); > > $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; > > return $toencode; > > } > > > > sub unescape { > > shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq > > $CGI::DefaultClass)); > > my $todecode = shift; > > return undef unless defined($todecode); > > $todecode =~ tr/+/ /; # pluses become spaces > > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;
> Here is missing one parenthesis '('
> > } > >
> But anyway also with parenthesis it's return: P= LEN=0
Sorry, it's also missing "return $todecode" before the last parenthesis.
From: mmaslano [...] redhat.com
On Fri Dec 05 09:47:29 2008, LDS wrote: Show quoted text
> On Fri Dec 05 02:00:18 2008, mmaslano@redhat.com wrote:
> > On Thu Dec 04 09:34:10 2008, LDS wrote:
> > > Browser submitted too fast.... try again > > > > > > Here are suggested replacements: > > > > > > sub escape { > > > shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq > > > $CGI::DefaultClass)); > > > my $toencode = shift; > > > return undef unless defined($toencode); > > > $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; > > > return $toencode; > > > } > > > > > > sub unescape { > > > shift() if @_ > 0 and (ref($_[0]) || (defined $_[1] && $_[0] eq > > > $CGI::DefaultClass)); > > > my $todecode = shift; > > > return undef unless defined($todecode); > > > $todecode =~ tr/+/ /; # pluses become spaces > > > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;
> > Here is missing one parenthesis '('
> > > } > > >
> > But anyway also with parenthesis it's return: P= LEN=0
> > Sorry, it's also missing "return $todecode" before the last parenthesis.
Hm I wasn't looking for next error after first one ;-) Here's the output. P=pe%C5%99%C3%AD%C4%8Dko LEN=22
From: kasal [...] ucw.cz
Hello LDS, I returned this this problem today. First, I can reproduce it as Marcela (mmaslano) has mentioned earlier, and also as it was reported in https://bugzilla.redhat.com/show_bug.cgi?id=472571 Show quoted text
> > > $todecode =~ tr/+/ /; # pluses become spaces > > > $todecode =~ s/%(?:([0-9a-fA-F]{2})/chr hex($1)/ge;
I made it: $todecode =~ s/%([0-9a-fA-F]{2})/chr hex($1)/ge; return $todecode; With these modified versions of encode and decode, the bug disappears. Indeed, the problem seems to be in line $toencode = eval { pack("C*", unpack("U0C*", $toencode))} || pack("C*", unpack("C*", $toencode)); The problem seems to be in the eval: it succeeds, replacing each two-byte UTF-8 encoded string by four bytes. And that's something that won't disappear in the later processing. Is the eval supposed to succeed at all? Actually, if I comment out this one eval from escape() from the original CGI/Util.pm, things start working. But is this the right fix?
From: kasal [...] ucw.cz
I performed more experiments and it seems that though pack("C*", unpack("U0C*", $toencode)) did return the original UTF-8 encoded string in perl-5.8.8, it is not longer so in perl-5.10. This change is why UTF-8 strings get garbled by CGI::escape. The original intent was to have utf8::encode here; the problem is the same: when utf8::encode is called on an UTF-8 encoded string, the result is an invalid sequence of bytes. OTOH, I found out that not only pack("C*", unpack("C*", $toencode)) but also pack("U0C*", unpack("U0C*", $toencode)) is safe in this situation. So I'm going to put the latter in the Fedora perl and hope for the best.
Could you send me the complete codes for encode and decode? I will incorporate it into the CGI.pm release. Lincoln On Mon Mar 30 12:37:08 2009, kasal wrote: Show quoted text
> I performed more experiments and it seems that though > pack("C*", unpack("U0C*", $toencode)) > did return the original UTF-8 encoded string in perl-5.8.8, it is not > longer so in perl-5.10. This change is why UTF-8 strings get garbled by > CGI::escape. > > The original intent was to have utf8::encode here; the problem is the > same: when utf8::encode is called on an UTF-8 encoded string, the result > is an invalid sequence of bytes. > > OTOH, I found out that not only > pack("C*", unpack("C*", $toencode)) > but also > pack("U0C*", unpack("U0C*", $toencode)) > is safe in this situation. > > So I'm going to put the latter in the Fedora perl and hope for the best.
From: skasal [...] redhat.com
On Mon Mar 30 15:57:00 2009, LDS wrote: Show quoted text
> Could you send me the complete codes for encode and decode? I will > incorporate it into the CGI.pm release.
I'm attaching a patch against CGI.pm-3.42. The code of escape is now: # URL-encode data # # We cannot use the %u escapes, they were rejected by W3C, so the official # way is %XX-escaped utf-8 encoding. # Naturally, Unicode strings have to be converted to their utf-8 byte # representation. (No action is required on 5.6.) # Byte strings were traditionally used directly as a sequence of octets. # This worked if they actually represented binary data (i.e. in CGI::Compress). # This also worked if these byte strings were actually utf-8 encoded; e.g., # when the source file used utf-8 without the apropriate "use utf8;". # This fails if the byte string is actually a Latin 1 encoded string, but it # was always so and cannot be fixed without breaking the binary data case. # -- Stepan Kasal <skasal@redhat.com> # sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::Default my $toencode = shift; return undef unless defined($toencode); utf8::encode($toencode) if ($] > 5.007 && utf8::is_utf8($toencode)); if ($EBCDIC) { $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg; } else { $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",ord($1))/eg; } return $toencode; }
2009-04-06 Stepan Kasal <skasal@redhat.com> * t/util-58.t: Add tests reflecting common usage. * CGI/Util.pm (encode): State what conversions are needed, in accordance to the common usage mentioned above; and code it. diff -ur CGI.pm-3.42/CGI/Util.pm CGI.pm-3.42/CGI/Util.pm --- CGI.pm-3.42/CGI/Util.pm 2008-09-08 15:58:52.000000000 +0200 +++ CGI.pm-3.42/CGI/Util.pm 2009-04-04 16:30:29.000000000 +0200 @@ -210,7 +210,6 @@ my $todecode = shift; return undef unless defined($todecode); $todecode =~ tr/+/ /; # pluses become spaces - $EBCDIC = "\t" ne "\011"; if ($EBCDIC) { $todecode =~ s/%([0-9a-fA-F]{2})/chr $A2E[hex($1)]/ge; } else { @@ -232,16 +231,24 @@ } # URL-encode data +# +# We cannot use the %u escapes, they were rejected by W3C, so the official +# way is %XX-escaped utf-8 encoding. +# Naturally, Unicode strings have to be converted to their utf-8 byte +# representation. (No action is required on 5.6.) +# Byte strings were traditionally used directly as a sequence of octets. +# This worked if they actually represented binary data (i.e. in CGI::Compress). +# This also worked if these byte strings were actually utf-8 encoded; e.g., +# when the source file used utf-8 without the apropriate "use utf8;". +# This fails if the byte string is actually a Latin 1 encoded string, but it +# was always so and cannot be fixed without breaking the binary data case. +# -- Stepan Kasal <skasal@redhat.com> +# sub escape { shift() if @_ > 1 and ( ref($_[0]) || (defined $_[1] && $_[0] eq $CGI::DefaultClass)); my $toencode = shift; return undef unless defined($toencode); - $toencode = eval { pack("C*", unpack("U0C*", $toencode))} || pack("C*", unpack("C*", $toencode)); - - # force bytes while preserving backward compatibility -- dankogai - # but commented out because it was breaking CGI::Compress -- lstein - # $toencode = eval { pack("U*", unpack("U0C*", $toencode))} || pack("C*", unpack("C*", $toencode)); - + utf8::encode($toencode) if ($] > 5.007 && utf8::is_utf8($toencode)); if ($EBCDIC) { $toencode=~s/([^a-zA-Z0-9_.~-])/uc sprintf("%%%02x",$E2A[ord($1)])/eg; } else { diff -ur CGI.pm-3.42/t/util-58.t CGI.pm-3.42/t/util-58.t --- CGI.pm-3.42/t/util-58.t 2003-04-14 20:32:22.000000000 +0200 +++ CGI.pm-3.42/t/util-58.t 2009-04-06 16:49:42.000000000 +0200 @@ -1,16 +1,29 @@ +# test CGI::Util::escape +use Test::More tests => 4; +use_ok("CGI::Util"); + +# Byte strings should be escaped byte by byte: +# 1) not a valid utf-8 sequence: +my $uri = "pe\x{f8}\x{ed}\x{e8}ko.ogg"; +is(CGI::Util::escape($uri), "pe%F8%ED%E8ko.ogg", "Escape a Latin-2 string"); + +# 2) is a valid utf-8 sequence, but not an UTF-8-flagged string +# This happens often: people write utf-8 strings to source, but forget +# to tell perl about it by "use utf8;"--this is obviously wrong, but we +# have to handle it gracefully, for compatibility with GCI.pm under +# perl-5.8.x # -# This tests CGI::Util::escape() when fed with UTF-8-flagged string -# -- dankogai -BEGIN { - if ($] < 5.008) { - print "1..0 # \$] == $] < 5.008\n"; - exit(0); - } -} +$uri = "pe\x{c5}\x{99}\x{c3}\x{ad}\x{c4}\x{8d}ko.ogg"; +is(CGI::Util::escape($uri), "pe%C5%99%C3%AD%C4%8Dko.ogg", + "Escape an utf-8 byte string"); -use Test::More tests => 2; -use_ok("CGI::Util"); -my $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji -is(CGI::Util::escape($uri), "%E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt", - "# Escape string with UTF-8 flag"); +SKIP: +{ + # This tests CGI::Util::escape() when fed with UTF-8-flagged string + # -- dankogai + skip("Unicode strings not available in $]", 1) if ($] < 5.008); + $uri = "\x{5c0f}\x{98fc} \x{5f3e}.txt"; # KOGAI, Dan, in Kanji + is(CGI::Util::escape($uri), "%E5%B0%8F%E9%A3%BC%20%E5%BC%BE.txt", + "Escape string with UTF-8 flag"); +} __END__
Thanks. The patch will be going into version 3.43