Bug #46596 for libwww-perl: Encoding problems in POST requests

Mon Jun 01 13:30:26 2009 polettix [...] cpan.org - Ticket created

Subject:

Encoding problems in POST requests

While interacting with a page that had a form with UTF-8 encoded chars outside ISO-8859- 1, I bumped across an error in which HTTP::Message complained about not receiving bytes in content(). In the particular case, the problem stems from the fact that POST allows us to pass values that are character strings, not octet strings; this is the correct behaviour IMHO, but POST should Do The Right Thing with the stuff it receives. The patch I'm proposing addresses this problem, taking the charset from the content-type and defaulting to ISO-8859-1 like the standard says. In this way, any string for which is_utf8 returns true is transformed to a sequence of bytes according to the charset, which makes HTTP::Message happy. The patch does not break the test suite. I'm also adding another test for this added feature. Hope this helps, Flavio.

Subject:

libwww-perl-5.826.patch

diff -ru libwww-perl-5.826/lib/HTTP/Request/Common.pm patched/libwww-perl-5.826/lib/HTTP/Request/Common.pm --- libwww-perl-5.826/lib/HTTP/Request/Common.pm 2009-02-13 14:57:52.000000000 +0100 +++ patched/libwww-perl-5.826/lib/HTTP/Request/Common.pm 2009-06-01 18:36:22.000000000 +0200 @@ -118,6 +118,18 @@ $req; } +sub _get_charset { + my ($req) = @_; + + # Taken from HTTP::Message + require HTTP::Headers::Util; + if (my @ct = HTTP::Headers::Util::split_header_words($req->header("Content-Type"))) { + my ($ct, undef, %ct_param) = @{$ct[-1]}; + return $ct_param{charset}; + } + + return; +} sub form_data # RFC1867 { @@ -125,10 +137,17 @@ my @data = ref($data) eq "HASH" ? %$data : @$data; # copy my $fhparts; my @parts; + + my $charset = lc(_get_charset($req) || 'ISO-8859-1'); + my($k,$v); while (($k,$v) = splice(@data, 0, 2)) { if (!ref($v)) { $k =~ s/([\\\"])/\\$1/g; # escape quotes and backslashes + if (utf8::is_utf8($v)) { + require Encode; + $v = Encode::encode($charset, $v, Encode::FB_CROAK()); + } push(@parts, qq(Content-Disposition: form-data; name="$k"$CRLF$CRLF$v)); }

Subject:

form-charset.t

#!perl -w use strict; use Test qw(plan ok); plan tests => 4; use HTML::Form; use Encode qw( is_utf8 ); my $form = HTML::Form->parse(<<"EOT", base => "http://example.com", strict => 1); <form method="POST" enctype="multipart/form-data; charset=utf-8"> <input type="hidden" name="encoded" value="ä™®"> </form> EOT # We expect the form's parameter to be a string of characters ok(is_utf8($form->value('encoded'))); # This should survive in this case, because UTF-8 encoding applies my $request = eval { $form->click() }; ok($request); # This form has wrong settings $form = HTML::Form->parse(<<"EOT", base => "http://example.com", strict => 1); <form method="POST" enctype="multipart/form-data; charset="ISO-8859-1"> <input type="hidden" name="encoded" value="ä™®"> </form> EOT # We expect the form's parameter to be a string of characters ok(is_utf8($form->value('encoded'))); # This should die in this case, because ISO-8859-1 encoding DOES NOT APPLY $request = eval { $form->click() }; ok(!$request);

Mon Jun 01 17:10:14 2009 GAAS [...] cpan.org - Correspondence added

You are right that something ought to be done :) I think your approach has problems. First off I don't think the "multipart/form-data" content type takes a "charset" parameter. I think that param is only used for the "text/*" types. Secondly I would hate to make the code switch behaviour based on the utf8::is_utf8() setting of the values passed. Third; the approach taken should also work for the "application/x-www- form-urlencoded" case so we end up with consistent behaviour. Unfortunately I don't have any coherent interface suggestion to make at this point, except perhaps telling users to pass in bytes (encoded strings). I'm open for suggestions; at this point more in the form of doc patches to explain the user interface we want.

Mon Jun 01 17:10:20 2009 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 03 03:43:24 2009 polettix [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #46596] Encoding problems in POST requests
Date:	Wed, 3 Jun 2009 09:42:42 +0200
To:	bug-libwww-perl [...] rt.cpan.org
From:	Flavio Poletti <polettix [...] gmail.com>

I see all your points, except maybe the second about switching behaviour based on the utf8::is_utf8() (something similar already happens behind the scenes when you call utf8::downgrade() in HTTP::Message). I also agree upon the fact that the library could be a little less "helpful" and document clearly that bytes should be passed in. At this point, anyway, I think that something should be done somewhere in the parsing process to avoid stuff being promoted to character strings instead of octet strings. To shut up and show the code, I think that the following test should apply: #!perl -w use strict; use Test qw(plan ok); plan tests => 2; use HTML::Form; use Encode qw( is_utf8 ); my $form = HTML::Form->parse(<<"EOT", base => "http://example.com", strict => 1); <form method="POST" enctype="multipart/form-data"> <input type="hidden" name="encoded" value="ä™®"> </form> EOT # We expect the form's parameter to be a string of octets ok(! is_utf8($form->value('encoded'))); # This should always survive because we didn't add anything to $form my $request = eval { $form->click() }; ok($request); # ------ cut here -------- Otherwise, the end user would end up with an "automatic" behaviour when receiving data, but would be forced to do something (i.e. ensure that all fields are octets) when sending them. This leads to confusion for parameters that come from the parsing process and are not directly set by the user. Well... this is exactly the problem I had! Cheers, Flavio. On Mon, Jun 1, 2009 at 11:10 PM, Gisle_Aas via RT < bug-libwww-perl@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=46596 > > > You are right that something ought to be done :) > > I think your approach has problems. First off I don't think the > "multipart/form-data" content > type takes a "charset" parameter. I think that param is only used for the > "text/*" types. > Secondly I would hate to make the code switch behaviour based on the > utf8::is_utf8() setting of > the values passed. Third; the approach taken should also work for the > "application/x-www- > form-urlencoded" case so we end up with consistent behaviour. > > Unfortunately I don't have any coherent interface suggestion to make at > this point, except > perhaps telling users to pass in bytes (encoded strings). I'm open for > suggestions; at this point > more in the form of doc patches to explain the user interface we want. >

Wed Jun 03 17:40:26 2009 GAAS [...] cpan.org - Correspondence added

I would think that this patch would help <http://github.com/gisle/libwww- perl/commit/ff583c4b194eb7437c71f6fb659ae03b9bffce70>

Thu Jun 04 02:20:35 2009 polettix [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #46596] Encoding problems in POST requests
Date:	Thu, 4 Jun 2009 08:19:58 +0200
To:	bug-libwww-perl [...] rt.cpan.org
From:	Flavio Poletti <polettix [...] gmail.com>

Hi, the accept-charset seems the way to go (surely better than my home-baked "charset" solution!) but it should be enhanced IMHO. The "UNKNOWN" case defaults to UTF-8, and the standard<http://www.w3.org/TR/html401/interact/forms.html>says: The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element. Hence, the best default should come from the document's encoding. If this is missing as well, it should probably default to ISO-8859-1 (like the document). Anyway yes... I know what "may" means, but if the UTF-8 path is chosen it would probably be wise to patch the docs as well to warn about it. Thanks, Flavio. On Wed, Jun 3, 2009 at 11:40 PM, Gisle_Aas via RT < bug-libwww-perl@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=46596 > > > I would think that this patch would help <http://github.com/gisle/libwww- > perl/commit/ff583c4b194eb7437c71f6fb659ae03b9bffce70> >

Mon Jun 15 08:51:32 2009 GAAS [...] cpan.org - Correspondence added

I think this problem is basically fixed in the content_charset branch. Still need a bit more testing and pondering before I merge this into the next release. [1] http://github.com/gisle/libwww-perl/commits/content_charset

Mon Jun 15 15:38:42 2009 GAAS [...] cpan.org - Correspondence added

Just uploaded libwww-perl-5.827 with these fixes.

Mon Jun 15 15:38:43 2009 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'