Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the CGI CPAN distribution.

Report information
The Basics
Id: 19913
Status: resolved
Priority: 0/
Queue: CGI

People
Owner: MARKSTOS [...] cpan.org
Requestors: yves.lejeune [...] kodak.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: param() returns bytes rather than a perl utf-8 string
- create a form, either in GET or POST mode, in a page whose charset is UTF-8, using the CGI::charset('utf-8') function, and "binmode STDOUT, ':utf8'". - use Internet Explorer (6.0), or Firefox, to answer the form - the browser sends form parameters encoded in the charset of the form, i.e. utf-8. In case of a form in GET mode, one can see utf-8 bytes encoded one by one, e.g. %C3%A9 for é. - the param() function returns the right string value, but it is not marked as utf-8. This is rather unexpected for me, because i have writen the form using perl utf-8 strings, not raw utf-8 bytes. I am not sure this would be considered as a bug by everybody, but it would be nice to have at least a note in the perldoc.
You have to do that manually: use Encode; my $foo = decode utf8 => $cgi->param('foo'); On Thu Jun 15 05:08:17 2006, guest wrote: Show quoted text
> - create a form, either in GET or POST mode, in a page whose charset > is UTF-8, using the CGI::charset('utf-8') function, and "binmode > STDOUT, ':utf8'". > - use Internet Explorer (6.0), or Firefox, to answer the form > - the browser sends form parameters encoded in the charset of the > form, i.e. utf-8. In case of a form in GET mode, one can see utf-8 > bytes encoded one by one, e.g. %C3%A9 for é. > - the param() function returns the right string value, but it is not > marked as utf-8. > > This is rather unexpected for me, because i have writen the form
using Show quoted text
> perl utf-8 strings, not raw utf-8 bytes. > > I am not sure this would be considered as a bug by everybody, but it > would be nice to have at least a note in the perldoc.
Subject: Re: [rt.cpan.org #19913] param() returns bytes rather than a perl utf-8 string
Date: Fri, 16 Jun 2006 10:46:50 +0200
To: bug-CGI.pm [...] rt.cpan.org
From: yves.lejeune [...] kodak.com
Thanks, i have started using the decode_utf8() function. Something strange in the module CGI/Util.pm, is that when a character of a parameter is encoded as a numeric unicode sequence, then the returned perl string will be marked as utf-8 (in latest version of the CGI module). Yves. "Burak Gürsoy via RT" <bug-CGI.pm@rt.cpan.org> 15/06/2006 20:13 Please respond to bug-CGI.pm@rt.cpan.org To yves.lejeune@kodak.com cc Subject [rt.cpan.org #19913] param() returns bytes rather than a perl utf-8 string <URL: http://rt.cpan.org/Ticket/Display.html?id=19913 > You have to do that manually: use Encode; my $foo = decode utf8 => $cgi->param('foo'); On Thu Jun 15 05:08:17 2006, guest wrote: Show quoted text
> - create a form, either in GET or POST mode, in a page whose charset > is UTF-8, using the CGI::charset('utf-8') function, and "binmode > STDOUT, ':utf8'". > - use Internet Explorer (6.0), or Firefox, to answer the form > - the browser sends form parameters encoded in the charset of the > form, i.e. utf-8. In case of a form in GET mode, one can see utf-8 > bytes encoded one by one, e.g. %C3%A9 for é. > - the param() function returns the right string value, but it is not > marked as utf-8. > > This is rather unexpected for me, because i have writen the form
using Show quoted text
> perl utf-8 strings, not raw utf-8 bytes. > > I am not sure this would be considered as a bug by everybody, but it > would be nice to have at least a note in the perldoc.
On Thu Jun 15 14:13:05 2006, BURAK wrote: Show quoted text
> You have to do that manually: > > use Encode; > my $foo = decode utf8 => $cgi->param('foo');
I think this is a bad behavior/feature of the module. Everytime I get a parameter I have to call 'decode()' manuually. A better solution would be (IMHO) that I tell the CGI object about the encoding of my GET parameter and '$cgi->param()' will decode them for me in a transparent way. (And if it's not valid UTF8 return undef.) So I can fix/change the encoding (from ISO-8859-1 to UTF8,...) of my website by fixing a simple line in my source and not by fixing a dozens of '$cgi->param()' calls. my $cgi = CGI->new(); $cgi->param_encoding('utf-8'); my $value = $cgi->param('name');
From: LDS [...] cpan.org
I'm always scared about touching the unicode stuff because I don't fully understand it. However, give this version of CGI a try -- when it detects that charset is set to "utf-8" (all lowercase with the dash), it will automatically call decode() on the CGI values when it first parses them. Lincoln On Thu Jul 20 05:23:42 2006, FUZZ wrote: Show quoted text
> On Thu Jun 15 14:13:05 2006, BURAK wrote: >
> > You have to do that manually: > > > > use Encode; > > my $foo = decode utf8 => $cgi->param('foo');
> > I think this is a bad behavior/feature of the module. Everytime I get a > parameter I have to call 'decode()' manuually. A better solution would > be (IMHO) that I tell the CGI object about the encoding of my GET > parameter and '$cgi->param()' will decode them for me in a transparent > way. (And if it's not valid UTF8 return undef.) > > So I can fix/change the encoding (from ISO-8859-1 to UTF8,...) of my > website by fixing a simple line in my source and not by fixing a dozens > of '$cgi->param()' calls. > > my $cgi = CGI->new(); > $cgi->param_encoding('utf-8'); > my $value = $cgi->param('name');
Download CGI.pm-3.21.tar.gz
application/x-gzip 224.2k

Message body not shown because it is not plain text.

From: LDS [...] cpan.org
Sorry, that version didn't work at all. Please try this version out. It runs decode() everytime you call param() so there is same overhead. Lincoln
Download CGI.pm-3.21.tar.gz
application/x-tgz 224.3k

Message body not shown because it is not plain text.

Subject: Re: [rt.cpan.org #19913]
Date: Wed, 22 Aug 2007 06:52:59 +0200
To: bug-CGI.pm [...] rt.cpan.org
From: "Marian Ďurkovič" <md [...] bts.sk>
I'm afraid this change needs to be reverted - it causes serious problems. Please see #27104 and #24804
CC: yves.lejeune [...] kodak.com
Subject: Re: [rt.cpan.org #19913]
Date: Wed, 22 Aug 2007 11:01:59 +0200
To: bug-CGI.pm [...] rt.cpan.org
From: yves.lejeune [...] carestreamhealth.com
You are probably right. Note that in my original bug report, my conclusion was: "I am not sure this would be considered as a bug by everybody, but it would be nice to have at least a note in the perldoc" I think that now I understand what was inconsistent in my analysis: CGI can not guess the relationship between the form using UTF-8 parameters, and the parameters of the resulting POST or GET request. CGI can only process the request parameters in the general case, i.e. as a string of bytes. The knowledge that request parameters are UTF-8 strings can only rely in the application code. At best there could be an optional feature, something like an option "all_request_parameters_are_utf8_strings", that would do the job in this specific context. Setting this option would prevent for instance to handle the upload of binary files. Best regards, Yves Lejeune. "Marian Ďurkovič via RT" <bug-CGI.pm@rt.cpan.org> wrote on 22/08/2007 09:45:36: Show quoted text
> > <URL: http://rt.cpan.org/Ticket/Display.html?id=19913 > > > I'm afraid this change needs to be reverted - it causes serious
problems. Show quoted text
> > Please see #27104 and #24804 > >
From: LDS [...] cpan.org
I have backed out the utf-8 decoding code and added a -utf8 argument that you can pass to "use CGI" at run time. Please try this version.
Download CGI.pm-3.30.tar.gz
application/binary 232.6k

Message body not shown because it is not plain text.

From: mistbeul [...] web.de
On Wed Aug 22 12:03:02 2007, LDS wrote: Show quoted text
> I have backed out the utf-8 decoding code and added a -utf8 argument > that you can pass to "use CGI" at run time. Please try this version.
Decoding in a mutator does not seem like a good idea. Client code might use param() to get at a param, thereby utf8-decoding, process the returned value and restore it using this same param() method. Then elsewhere in the program, it might use param() again, thereby once more utf8 decoding what already has been decoded - and boom. As a quick workaround, I'd suggest to modify line 455 in v3.30, so that decoding only happens if is_utf8($_) is false. Best regards, Bodo
On Tue Sep 11 18:47:02 2007, bobesch wrote: Show quoted text
> On Wed Aug 22 12:03:02 2007, LDS wrote:
> > I have backed out the utf-8 decoding code and added a -utf8 argument > > that you can pass to "use CGI" at run time. Please try this version.
> > Decoding in a mutator does not seem like a good idea. Client code might > use param() to get at a param, thereby utf8-decoding, process the > returned value and restore it using this same param() method. Then > elsewhere in the program, it might use param() again, thereby once more > utf8 decoding what already has been decoded - and boom. > > As a quick workaround, I'd suggest to modify line 455 in v3.30, so that > decoding only happens if is_utf8($_) is false.
This ticket hasn't been commented on in a long time. Does the UTF-8 handling in 3.43 look OK to you? Mark
Attached patch against 3.45 includes test case and fix for the "double UTF-8 decode()" scenario described in this ticket. Basically it skips the Encode::decode() call if Encode::is_utf8() for the value is true (its already flagged UTF-8). Let me know if you have any questions.
diff --git a/lib/CGI.pm b/lib/CGI.pm index cacb03a..1b90a6a 100644 --- a/lib/CGI.pm +++ b/lib/CGI.pm @@ -455,12 +455,23 @@ sub param { if ($PARAM_UTF8) { eval "require Encode; 1;" unless Encode->can('decode'); # bring in these functions - @result = map {ref $_ ? $_ : Encode::decode(utf8=>$_) } @result; + @result = map {ref $_ ? $_ : $self->_decode_utf8($_) } @result; } return wantarray ? @result : $result[0]; } +sub _decode_utf8 { + my ($self, $val) = @_; + + if (Encode::is_utf8($val)) { + return $val; + } + else { + return Encode::decode(utf8 => $val); + } +} + sub self_or_default { return @_ if defined($_[0]) && (!ref($_[0])) &&($_[0] eq 'CGI'); unless (defined($_[0]) && diff --git a/t/utf8.t b/t/utf8.t new file mode 100644 index 0000000..8b5ad23 --- /dev/null +++ b/t/utf8.t @@ -0,0 +1,34 @@ +##!./perl -wT + +use strict; +use utf8; +use lib qw(t/lib); + +# Due to a bug in older versions of MakeMaker & Test::Harness, we must +# ensure the blib's are in @INC, else we might use the core CGI.pm +use lib qw(blib/lib blib/arch); + +use Test::More tests => 7; +use Encode; + +use_ok( 'CGI' ); + +ok( my $q = CGI->new(), 'create a new CGI object' ); + +$CGI::PARAM_UTF8 = 1; + +my $data = 'áéíóúµ'; +ok Encode::is_utf8($data), "created UTF-8 encoded data string"; + +# now set the param. +$q->param(data => $data); + +# if param() runs the data through Encode::decode(), this will fail. +is $q->param('data'), $data; + +# make sure setting bytes decodes properly +my $bytes = Encode::encode(utf8 => $data); +ok !Encode::is_utf8($bytes), "converted UTF-8 to bytes"; +$q->param(data => $bytes); +is $q->param('data'), $data; +ok Encode::is_utf8($q->param('data')), 'param() decoded UTF-8';
Thanks, this is patched in my github repo now.
Subject: Thanks, released
The patch for this ticket has now been released in CGI.pm 3.47, and this ticket is considered resolved. Thanks again for you help to improve CGI.pm! Mark