Bug #19913 for CGI: param() returns bytes rather than a perl utf-8 string

Thu Jun 15 05:08:17 2006 Guest - Ticket created

Subject:

param() returns bytes rather than a perl utf-8 string

- create a form, either in GET or POST mode, in a page whose charset is UTF-8, using the CGI::charset('utf-8') function, and "binmode STDOUT, ':utf8'". - use Internet Explorer (6.0), or Firefox, to answer the form - the browser sends form parameters encoded in the charset of the form, i.e. utf-8. In case of a form in GET mode, one can see utf-8 bytes encoded one by one, e.g. %C3%A9 for é. - the param() function returns the right string value, but it is not marked as utf-8. This is rather unexpected for me, because i have writen the form using perl utf-8 strings, not raw utf-8 bytes. I am not sure this would be considered as a bug by everybody, but it would be nice to have at least a note in the perldoc.

Thu Jun 15 14:13:05 2006 burak [...] cpan.org - Correspondence added

You have to do that manually: use Encode; my $foo = decode utf8 => $cgi->param('foo'); On Thu Jun 15 05:08:17 2006, guest wrote: Show quoted text

> - create a form, either in GET or POST mode, in a page whose charset > is UTF-8, using the CGI::charset('utf-8') function, and "binmode > STDOUT, ':utf8'". > - use Internet Explorer (6.0), or Firefox, to answer the form > - the browser sends form parameters encoded in the charset of the > form, i.e. utf-8. In case of a form in GET mode, one can see utf-8 > bytes encoded one by one, e.g. %C3%A9 for é. > - the param() function returns the right string value, but it is not > marked as utf-8. > > This is rather unexpected for me, because i have writen the form

using Show quoted text

> perl utf-8 strings, not raw utf-8 bytes. > > I am not sure this would be considered as a bug by everybody, but it > would be nice to have at least a note in the perldoc.

Thu Jun 15 14:13:06 2006 The RT System itself - Status changed from 'new' to 'open'

Fri Jun 16 04:47:27 2006 yves.lejeune [...] kodak.com - Correspondence added

Subject:	Re: [rt.cpan.org #19913] param() returns bytes rather than a perl utf-8 string
Date:	Fri, 16 Jun 2006 10:46:50 +0200
To:	bug-CGI.pm [...] rt.cpan.org
From:	yves.lejeune [...] kodak.com

Thanks, i have started using the decode_utf8() function. Something strange in the module CGI/Util.pm, is that when a character of a parameter is encoded as a numeric unicode sequence, then the returned perl string will be marked as utf-8 (in latest version of the CGI module). Yves. "Burak Gürsoy via RT" <bug-CGI.pm@rt.cpan.org> 15/06/2006 20:13 Please respond to bug-CGI.pm@rt.cpan.org To yves.lejeune@kodak.com cc Subject [rt.cpan.org #19913] param() returns bytes rather than a perl utf-8 string <URL: http://rt.cpan.org/Ticket/Display.html?id=19913 > You have to do that manually: use Encode; my $foo = decode utf8 => $cgi->param('foo'); On Thu Jun 15 05:08:17 2006, guest wrote: Show quoted text

> - create a form, either in GET or POST mode, in a page whose charset > is UTF-8, using the CGI::charset('utf-8') function, and "binmode > STDOUT, ':utf8'". > - use Internet Explorer (6.0), or Firefox, to answer the form > - the browser sends form parameters encoded in the charset of the > form, i.e. utf-8. In case of a form in GET mode, one can see utf-8 > bytes encoded one by one, e.g. %C3%A9 for é. > - the param() function returns the right string value, but it is not > marked as utf-8. > > This is rather unexpected for me, because i have writen the form

using Show quoted text

> perl utf-8 strings, not raw utf-8 bytes. > > I am not sure this would be considered as a bug by everybody, but it > would be nice to have at least a note in the perldoc.

Thu Jul 20 05:23:42 2006 fuzz [...] namm.de - Correspondence added

On Thu Jun 15 14:13:05 2006, BURAK wrote: Show quoted text

> You have to do that manually: > > use Encode; > my $foo = decode utf8 => $cgi->param('foo');

I think this is a bad behavior/feature of the module. Everytime I get a parameter I have to call 'decode()' manuually. A better solution would be (IMHO) that I tell the CGI object about the encoding of my GET parameter and '$cgi->param()' will decode them for me in a transparent way. (And if it's not valid UTF8 return undef.) So I can fix/change the encoding (from ISO-8859-1 to UTF8,...) of my website by fixing a simple line in my source and not by fixing a dozens of '$cgi->param()' calls. my $cgi = CGI->new(); $cgi->param_encoding('utf-8'); my $value = $cgi->param('name');

Thu Jul 20 15:52:29 2006 LDS [...] cpan.org - Correspondence added

From:

LDS [...] cpan.org

I'm always scared about touching the unicode stuff because I don't fully understand it. However, give this version of CGI a try -- when it detects that charset is set to "utf-8" (all lowercase with the dash), it will automatically call decode() on the CGI values when it first parses them. Lincoln On Thu Jul 20 05:23:42 2006, FUZZ wrote: Show quoted text

> On Thu Jun 15 14:13:05 2006, BURAK wrote: >

> > You have to do that manually: > > > > use Encode; > > my $foo = decode utf8 => $cgi->param('foo');

> > I think this is a bad behavior/feature of the module. Everytime I get a > parameter I have to call 'decode()' manuually. A better solution would > be (IMHO) that I tell the CGI object about the encoding of my GET > parameter and '$cgi->param()' will decode them for me in a transparent > way. (And if it's not valid UTF8 return undef.) > > So I can fix/change the encoding (from ISO-8859-1 to UTF8,...) of my > website by fixing a simple line in my source and not by fixing a dozens > of '$cgi->param()' calls. > > my $cgi = CGI->new(); > $cgi->param_encoding('utf-8'); > my $value = $cgi->param('name');

Download CGI.pm-3.21.tar.gz
application/x-gzip 224.2k

Message body not shown because it is not plain text.

Thu Jul 20 15:53:44 2006 LDS [...] cpan.org - Given to LDS

Thu Jul 20 16:48:42 2006 LDS [...] cpan.org - Correspondence added

From:

LDS [...] cpan.org

Sorry, that version didn't work at all. Please try this version out. It runs decode() everytime you call param() so there is same overhead. Lincoln

Download CGI.pm-3.21.tar.gz
application/x-tgz 224.3k

Message body not shown because it is not plain text.

Wed Aug 22 03:45:34 2007 md [...] bts.sk - Correspondence added

Subject:	Re: [rt.cpan.org #19913]
Date:	Wed, 22 Aug 2007 06:52:59 +0200
To:	bug-CGI.pm [...] rt.cpan.org
From:	"Marian Ďurkovič" <md [...] bts.sk>

I'm afraid this change needs to be reverted - it causes serious problems. Please see #27104 and #24804

Wed Aug 22 05:12:47 2007 yves.lejeune [...] carestreamhealth.com - Correspondence added

CC:	yves.lejeune [...] kodak.com
Subject:	Re: [rt.cpan.org #19913]
Date:	Wed, 22 Aug 2007 11:01:59 +0200
To:	bug-CGI.pm [...] rt.cpan.org
From:	yves.lejeune [...] carestreamhealth.com

You are probably right. Note that in my original bug report, my conclusion was: "I am not sure this would be considered as a bug by everybody, but it would be nice to have at least a note in the perldoc" I think that now I understand what was inconsistent in my analysis: CGI can not guess the relationship between the form using UTF-8 parameters, and the parameters of the resulting POST or GET request. CGI can only process the request parameters in the general case, i.e. as a string of bytes. The knowledge that request parameters are UTF-8 strings can only rely in the application code. At best there could be an optional feature, something like an option "all_request_parameters_are_utf8_strings", that would do the job in this specific context. Setting this option would prevent for instance to handle the upload of binary files. Best regards, Yves Lejeune. "Marian Ďurkovič via RT" <bug-CGI.pm@rt.cpan.org> wrote on 22/08/2007 09:45:36: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=19913 > > > I'm afraid this change needs to be reverted - it causes serious

problems. Show quoted text

> > Please see #27104 and #24804 > >

Wed Aug 22 12:03:02 2007 LDS [...] cpan.org - Correspondence added

From:

LDS [...] cpan.org

I have backed out the utf-8 decoding code and added a -utf8 argument that you can pass to "use CGI" at run time. Please try this version.

Download CGI.pm-3.30.tar.gz
application/binary 232.6k

Message body not shown because it is not plain text.

Tue Sep 11 18:47:02 2007 mistbeul [...] web.de - Correspondence added

From:

mistbeul [...] web.de

On Wed Aug 22 12:03:02 2007, LDS wrote: Show quoted text

> I have backed out the utf-8 decoding code and added a -utf8 argument > that you can pass to "use CGI" at run time. Please try this version.

Decoding in a mutator does not seem like a good idea. Client code might use param() to get at a param, thereby utf8-decoding, process the returned value and restore it using this same param() method. Then elsewhere in the program, it might use param() again, thereby once more utf8 decoding what already has been decoded - and boom. As a quick workaround, I'd suggest to modify line 455 in v3.30, so that decoding only happens if is_utf8($_) is false. Best regards, Bodo

Fri Jul 24 20:37:08 2009 MARKSTOS [...] cpan.org - Stolen from LDS

Fri Jul 24 20:40:54 2009 MARKSTOS [...] cpan.org - Correspondence added

On Tue Sep 11 18:47:02 2007, bobesch wrote: Show quoted text

> On Wed Aug 22 12:03:02 2007, LDS wrote:

> > I have backed out the utf-8 decoding code and added a -utf8 argument > > that you can pass to "use CGI" at run time. Please try this version.

> > Decoding in a mutator does not seem like a good idea. Client code might > use param() to get at a param, thereby utf8-decoding, process the > returned value and restore it using this same param() method. Then > elsewhere in the program, it might use param() again, thereby once more > utf8 decoding what already has been decoded - and boom. > > As a quick workaround, I'd suggest to modify line 455 in v3.30, so that > decoding only happens if is_utf8($_) is false.

This ticket hasn't been commented on in a long time. Does the UTF-8 handling in 3.43 look OK to you? Mark

Fri Jul 24 20:40:59 2009 MARKSTOS [...] cpan.org - Status changed from 'open' to 'stalled'

Thu Aug 27 17:49:04 2009 MSCHOUT [...] cpan.org - Correspondence added

Attached patch against 3.45 includes test case and fix for the "double UTF-8 decode()" scenario described in this ticket. Basically it skips the Encode::decode() call if Encode::is_utf8() for the value is true (its already flagged UTF-8). Let me know if you have any questions.

diff --git a/lib/CGI.pm b/lib/CGI.pm index cacb03a..1b90a6a 100644 --- a/lib/CGI.pm +++ b/lib/CGI.pm @@ -455,12 +455,23 @@ sub param { if ($PARAM_UTF8) { eval "require Encode; 1;" unless Encode->can('decode'); # bring in these functions - @result = map {ref $_ ? $_ : Encode::decode(utf8=>$_) } @result; + @result = map {ref $_ ? $_ : $self->_decode_utf8($_) } @result; } return wantarray ? @result : $result[0]; } +sub _decode_utf8 { + my ($self, $val) = @_; + + if (Encode::is_utf8($val)) { + return $val; + } + else { + return Encode::decode(utf8 => $val); + } +} + sub self_or_default { return @_ if defined($_[0]) && (!ref($_[0])) &&($_[0] eq 'CGI'); unless (defined($_[0]) && diff --git a/t/utf8.t b/t/utf8.t new file mode 100644 index 0000000..8b5ad23 --- /dev/null +++ b/t/utf8.t @@ -0,0 +1,34 @@ +##!./perl -wT + +use strict; +use utf8; +use lib qw(t/lib); + +# Due to a bug in older versions of MakeMaker & Test::Harness, we must +# ensure the blib's are in @INC, else we might use the core CGI.pm +use lib qw(blib/lib blib/arch); + +use Test::More tests => 7; +use Encode; + +use_ok( 'CGI' ); + +ok( my $q = CGI->new(), 'create a new CGI object' ); + +$CGI::PARAM_UTF8 = 1; + +my $data = 'Ã¡Ã©ÃÃ³ÃºÂµ'; +ok Encode::is_utf8($data), "created UTF-8 encoded data string"; + +# now set the param. +$q->param(data => $data); + +# if param() runs the data through Encode::decode(), this will fail. +is $q->param('data'), $data; + +# make sure setting bytes decodes properly +my $bytes = Encode::encode(utf8 => $data); +ok !Encode::is_utf8($bytes), "converted UTF-8 to bytes"; +$q->param(data => $bytes); +is $q->param('data'), $data; +ok Encode::is_utf8($q->param('data')), 'param() decoded UTF-8';

Thu Aug 27 17:49:05 2009 The RT System itself - Status changed from 'stalled' to 'open'

Thu Aug 27 22:02:06 2009 MARKSTOS [...] cpan.org - Correspondence added

Thanks, this is patched in my github repo now.

Thu Aug 27 22:02:08 2009 MARKSTOS [...] cpan.org - Status changed from 'open' to 'patched'

Wed Sep 09 22:06:13 2009 MARKSTOS [...] cpan.org - Correspondence added

Subject:

Thanks, released

The patch for this ticket has now been released in CGI.pm 3.47, and this ticket is considered resolved. Thanks again for you help to improve CGI.pm! Mark

Wed Sep 09 22:06:14 2009 The RT System itself - Status changed from 'patched' to 'open'

Wed Sep 09 22:06:17 2009 MARKSTOS [...] cpan.org - Status changed from 'open' to 'resolved'

Fri May 23 14:28:40 2014 The RT System itself - Queue changed from CGI.pm to CGI

Bug #19913 for CGI: param() returns bytes rather than a perl utf-8 string

Preferred bug tracker