Bug #43859 for URI: should be _utf8_off -ed raw data before URI encoding

Wed Mar 04 21:18:16 2009 msmouse [...] gmail.com - Ticket created

Subject:	should be _utf8_off -ed raw data before URI encoding
Date:	Thu, 5 Mar 2009 10:17:58 +0800
To:	bug-URI [...] rt.cpan.org
From:	msmouse <msmouse [...] gmail.com>

Hi, I'm here to report a problem with the latest URI / URI::Escape suite. (URI-1.37, URI::Escape-3.29, Windows Vista sp1 x86, Strawberry perl 5.10.0.3 built for MSWin32-x86-multi-thread) I was trying to POST string in gbk encoding via WWW::Mechanize and got wrong resullt. I discovered that it was caused by a join of utf8-flaged and non-utf8-flaged strings -- /(\C)/g matching returns wrong result on such joined string. I don't known by whom (Mech or LWP?) the utf8-flag of the form key which is in ASCII was turned on, and afterwards it was joined with my non-utf8 value string. URI::Escape::escape_char uses /(\C)/g to get raw bytes and got the raw result. to make clear the problem: 1 use strict; 2 use warnings; 3 use Encode; 4 use utf8; 5 6 my $s1 = decode('utf8', 's1'); 7 my $s2 = encode('gbk','公司'); 8 my $s3 = "$s1+$s2"; 9 10 print_is_utf8($s1); 11 print_is_utf8($s2); 12 print_is_utf8($s3); 13 14 15 print_str($s1); 16 print_str($s2); 17 print_str($s3); 18 19 sub print_str { 20 my $str = shift; 21 print "$str: "; 22 print unpack('H*', $str) . '=' . join('+', map {unpack('H*', $_)} ($str=~/(\C)/g)) . "\n"; 23 } 24 25 sub print_is_utf8 { 26 my $str = shift; 27 print +(Encode::is_utf8($str)?"y":"n"), "\n"; 28 } ~ ~ Test result: y n y s1: 7331=73+31 公司: b9abcbbe=b9+ab+cb+be s1+公司: 73312bb9abcbbe=73+31+2b+c2+b9+c2+ab+c3+8b+c2+be <-- wrong result To solve the problem, URI could force the utf8 flag off for all the keys and values before escaping: (in sub query_form(), _query.pm, around line 40 ) $delim = pop if @_ % 2; use Encode; <--------my fix map {Encode::_utf8_off($_)} @_; <------ my fix my @query; while (my($key,$vals) = splice(@_, 0, 2)) { $key = '' unless defined $key; $key =~ s/([;\/?:@&=+,\$\[\]%])/ URI::Escape::escape_char($1)/eg; $key =~ s/ /+/g; $vals = [ref($vals) eq "ARRAY" ? @$vals : $vals]; I don't known whether this is a CORE bug or intented, but . However I think URI should treat the data to be escaped as raw, so maybe you can accept my fix. Thank you! msmouse ---------------------------------- msmouse@ir.hit.edu.cn msmouse@gmail.com

Mon May 30 04:42:02 2011 NANTO [...] cpan.org - Correspondence added

I agree this ticket. For that matter, I think URI (as defined in RFC 3986) should always be octets. URI itself has no information about character encodings. Ideally, percent-encoded parts of a URI can be decoded with an arbitrary character encoding. Therefore a percent-decoded value should be octets. But the URI module accepts internationalized domain names against RFC 3986, that makes this problem more complex. Consider about a URI "http://日本語.jp/?q=%E5%AD%97" ("日本語" is U+65E5 U+672C U+8A9E). my $u = URI->new("http://\x{65E5}\x{672C}\x{8A9E}.jp/?q=%E5AD%97"); $u->host; # "xn--wgv71a119e.jp" $u->query_param('q'); # "\x{e5}\x{ad}\x{97}" (UTF8 flag is on) The query_param method returns a UTF8-flagged string "\x{e5}\x{ad}\x{97}", not octets. I can get octets when I pass octets to URI->new, but then a host method returns a different value from the string case. my $u = URI->new(encode_utf8("http://\x{65E5}\x{672C}\x{8A9E}.jp/?q=%E5AD%97")); $u->host; # "xn--xakg0a0al61bcu.jp" $u->query_param('q'); # "\xe5\xad\x97" (UTF8 flag is off) I think in both cases a host method should return "xn--wgv71a119e.jp" and a query_param method should return octets "\xe5\xad\x97". I write a patch to fix this problem. https://github.com/nanto/uri/compare/master...coerce_octets By this patch: 1. Arguments of all methods (except IRI-related methods) are treated as octets. As a result, return values also become octets. UTF8-flagged strings are treated as octets encoded with UTF-8. 2. In a host part, octets are treated as UTF-8 encoded octets. They are decoded with UTF-8 and then encoded in Punycode. This special treatment is for programs that already use IDN with URI module. 3. URI module loses backward compatibility by these changes. In particular, octets in a host part are no longer treated as a Latin-1 string. For users who want compatibity, a $URI::COERCE_OCTETS variable, whose defualt value is true, is introduced. Setting this variable to false makes URI module's behavior the same as version 1.58. Thank you. -- nanto_vi (TOYAMA Nao) nanto@moon.email.ne.jp

Mon May 30 04:42:03 2011 The RT System itself - Status changed from 'new' to 'open'