Subject: | should be _utf8_off -ed raw data before URI encoding |
Date: | Thu, 5 Mar 2009 10:17:58 +0800 |
To: | bug-URI [...] rt.cpan.org |
From: | msmouse <msmouse [...] gmail.com> |
Hi,
I'm here to report a problem with the latest URI / URI::Escape suite.
(URI-1.37, URI::Escape-3.29, Windows Vista sp1 x86, Strawberry perl 5.10.0.3
built for MSWin32-x86-multi-thread)
I was trying to POST string in gbk encoding via WWW::Mechanize and got wrong
resullt. I discovered that it was caused by a join of utf8-flaged and
non-utf8-flaged strings -- /(\C)/g matching returns wrong result on such
joined string. I don't known by whom (Mech or LWP?) the utf8-flag of the
form key which is in ASCII was turned on, and afterwards it was joined with
my non-utf8 value string. URI::Escape::escape_char uses /(\C)/g to get raw
bytes and got the raw result.
to make clear the problem:
1 use strict;
2 use warnings;
3 use Encode;
4 use utf8;
5
6 my $s1 = decode('utf8', 's1');
7 my $s2 = encode('gbk','公司');
8 my $s3 = "$s1+$s2";
9
10 print_is_utf8($s1);
11 print_is_utf8($s2);
12 print_is_utf8($s3);
13
14
15 print_str($s1);
16 print_str($s2);
17 print_str($s3);
18
19 sub print_str {
20 my $str = shift;
21 print "$str: ";
22 print unpack('H*', $str) . '=' . join('+', map {unpack('H*', $_)}
($str=~/(\C)/g)) . "\n";
23 }
24
25 sub print_is_utf8 {
26 my $str = shift;
27 print +(Encode::is_utf8($str)?"y":"n"), "\n";
28 }
~
~
Test result:
y
n
y
s1: 7331=73+31
公司: b9abcbbe=b9+ab+cb+be
s1+公司: 73312bb9abcbbe=73+31+2b+c2+b9+c2+ab+c3+8b+c2+be <-- wrong result
To solve the problem, URI could force the utf8 flag off for all the keys and
values before escaping:
(in sub query_form(), _query.pm, around line 40 )
$delim = pop if @_ % 2;
use Encode; <--------my fix
map {Encode::_utf8_off($_)} @_; <------ my fix
my @query;
while (my($key,$vals) = splice(@_, 0, 2)) {
$key = '' unless defined $key;
$key =~ s/([;\/?:@&=+,\$\[\]%])/ URI::Escape::escape_char($1)/eg;
$key =~ s/ /+/g;
$vals = [ref($vals) eq "ARRAY" ? @$vals : $vals];
I don't known whether this is a CORE bug or intented, but . However I think
URI should treat the data to be escaped as raw, so maybe you can accept my
fix.
Thank you!
msmouse
----------------------------------
msmouse@ir.hit.edu.cn
msmouse@gmail.com