Bug #34769 for URI: Escape.pm encodes to RFC 2396/2732 , but 3986 is most current

Tue Apr 08 12:40:00 2008 cpan [...] 2xlp.com - Ticket created

Subject:

Escape.pm encodes to RFC 2396/2732 , but 3986 is most current

2732: unreserved [A-Za-z0-9\-\._~!*'()] reserved ;/?:@&=+$,[] 3986: unreserved: [A-Za-z0-9\-\._~] reserved: gen-delims :/?#[]@ sub-delims !$&'()*+,;= the fix would be to make *'() reserved since 2732 is still popular, i would suggest using either a package/global var or a keyword to let people choose a default behavior/override... but conform to 3986 otherwise

Mon Mar 16 20:43:55 2009 mschwern [...] cpan.org - Correspondence added

URI::Find needs this in order to round trip URLs with Unicode in them such as http://➡.ws/䯡 Even better would be a way to turn escaping off entirely or a method to get an unescaped version.

Mon Mar 16 20:43:56 2009 The RT System itself - Status changed from 'new' to 'open'

Tue Mar 17 01:55:13 2009 GAAS [...] cpan.org - Correspondence added

On Mon Mar 16 20:43:55 2009, MSCHWERN wrote: Show quoted text

> URI::Find needs this in order to round trip URLs with Unicode in them > such as http://➡.ws/䯡

How does making *'() reserved affect this Unicode stuff? Show quoted text

> Even better would be a way to turn escaping off entirely or a method to > get an unescaped version.

Please expand. I don't understand what you are suggesting here.

Wed Aug 05 22:01:20 2009 ddascalescu+perl [...] gmail.com - Correspondence added

From:

ddascalescu+perl [...] gmail.com

On Tue Mar 17 01:55:13 2009, GAAS wrote: Show quoted text

> On Mon Mar 16 20:43:55 2009, MSCHWERN wrote:

> > URI::Find needs this in order to round trip URLs with Unicode in them > > such as http://➡.ws/䯡

> > How does making *'() reserved affect this Unicode stuff? >

> > Even better would be a way to turn escaping off entirely or a method to > > get an unescaped version.

> > Please expand. I don't understand what you are suggesting here.

Any updates on this?

Thu Jan 28 05:43:12 2010 mschwern [...] cpan.org - Correspondence added

On Tue Mar 17 01:55:13 2009, GAAS wrote: Show quoted text

> On Mon Mar 16 20:43:55 2009, MSCHWERN wrote:

> > URI::Find needs this in order to round trip URLs with Unicode in them > > such as http://➡.ws/䯡

> > How does making *'() reserved affect this Unicode stuff? >

> > Even better would be a way to turn escaping off entirely or a method to > > get an unescaped version.

> > Please expand. I don't understand what you are suggesting here.

IIRC I was referring to being able to control the set of escaped characters. In this case, turning them all off so I could get URI to stop escaping more than I'd like.

Thu Jan 28 11:13:33 2010 IKEGAMI [...] cpan.org - Correspondence added

This is a dup of RT#21640

Thu Jan 28 16:56:45 2010 mschwern [...] cpan.org - Correspondence added

The attached patches update URI::Escape to RFC 3986. The first patch rewrites t/escape.t to use Test::More. This was just some housecleaning to do before updating the test. The second patch updates the docs to refer to RFC 3986 and updates the default regex used to RFC 3986 standard. You'll note I left in the 2732 regex. Its not used right now, but I can imagine a scenario where URI::Escape offers the user which escape to use. A flexible set of options can be added to uri_escape() in place of the second argument. For example... uri_escape( $uri, { rfc => 2732 } ); Would be equivalent to: uri_escape( $uri, "^A-Za-z0-9\-_.!~*'()" ); Except the user won't get it wrong. The current manual escape syntax can be deprecated and replaced with: uri_escape( $uri, { escape => qr/[^...]/ } ); Or even... uri_escape( $uri, { escape => ["A".."Z", "a".."z", ...] } ); uri_escape() would check if the second argument is a hash ref to determine how to interpret it. I can't see a use case for users preserving RFC 2732 semantics except maybe to support fiddly old tests that they don't want to change, so I'm not going to bother patching it. Its worth noting that the URI::Escape patch did not effect the URI tests at all.

Subject:

0001-Rewrite-the-URI-Escape-tests-with-Test-More.patch

From c17e39a26bb4a2a7ea715ca75470e85111808d54 Mon Sep 17 00:00:00 2001 From: Michael G. Schwern <schwern@pobox.com> Date: Thu, 28 Jan 2010 13:41:57 -0800 Subject: [PATCH 1/2] Rewrite the URI::Escape tests with Test::More --- t/escape.t | 43 +++++++++++++++++-------------------------- 1 files changed, 17 insertions(+), 26 deletions(-) diff --git a/t/escape.t b/t/escape.t index daebd9d..46da877 100644 --- a/t/escape.t +++ b/t/escape.t @@ -1,48 +1,39 @@ #!perl -w -print "1..9\n"; +use strict; +use warnings; + +use Test::More tests => 10; use URI::Escape; -print "not " unless uri_escape("|abcå") eq "%7Cabc%E5"; -print "ok 1\n"; +is uri_escape("|abcå"), "%7Cabc%E5"; -print "not " unless uri_escape("abc", "b-d") eq "a%62%63"; -print "ok 2\n"; +is uri_escape("abc", "b-d"), "a%62%63"; -print "not " if defined(uri_escape(undef)); -print "ok 3\n"; +is uri_escape(undef), undef; -print "not " unless uri_unescape("%7Cabc%e5") eq "|abcå"; -print "ok 4\n"; +is uri_unescape("%7Cabc%e5"), "|abcå"; -print "not " unless join(":", uri_unescape("%40A%42", "CDE", "F%47H")) eq - '@AB:CDE:FGH'; -print "ok 5\n"; +is_deeply [uri_unescape("%40A%42", "CDE", "F%47H")], [qw(@AB CDE FGH)]; use URI::Escape qw(%escapes); -print "not" unless $escapes{"%"} eq "%25"; -print "ok 6\n"; +is $escapes{"%"}, "%25"; use URI::Escape qw(uri_escape_utf8); -print "not " unless uri_escape_utf8("|abcå") eq "%7Cabc%C3%A5"; -print "ok 7\n"; +is uri_escape_utf8("|abcå"), "%7Cabc%C3%A5"; -if ($] < 5.008) { - print "ok 8 # skip perl-5.8 required\n"; - print "ok 9 # skip perl-5.8 required\n"; -} -else { - eval { print uri_escape("abc" . chr(300)) }; - print "not " unless $@ && $@ =~ /^Can\'t escape \\x{012C}, try uri_escape_utf8 instead/; - print "ok 8\n"; +SKIP: { + skip "Perl 5.8.0 or higher required", 3 if $] < 5.008; + + ok !eval { print uri_escape("abc" . chr(300)); 1 }; + like $@, qr/^Can\'t escape \\x{012C}, try uri_escape_utf8 instead/; - print "not " unless uri_escape_utf8(chr(0xFFF)) eq "%E0%BF%BF"; - print "ok 9\n"; + is uri_escape_utf8(chr(0xFFF)), "%E0%BF%BF"; } -- 1.6.6.1

Subject:

0002-Update-URI-Escape-for-RFC-3986.patch

From 24f4d462614e303bccb64675bfefe01fbd9ffc40 Mon Sep 17 00:00:00 2001 From: Michael G. Schwern <schwern@pobox.com> Date: Thu, 28 Jan 2010 13:45:50 -0800 Subject: [PATCH 2/2] Update URI::Escape for RFC 3986 --- URI/Escape.pm | 43 ++++++++++++++++++++++++------------------- t/escape.t | 7 ++++--- 2 files changed, 28 insertions(+), 22 deletions(-) diff --git a/URI/Escape.pm b/URI/Escape.pm index c2da23b..4543212 100644 --- a/URI/Escape.pm +++ b/URI/Escape.pm @@ -15,26 +15,27 @@ URI::Escape - Escape and unescape unsafe characters =head1 DESCRIPTION This module provides functions to escape and unescape URI strings as -defined by RFC 2396 (and updated by RFC 2732). -A URI consists of a restricted set of characters, -denoted as C<uric> in RFC 2396. The restricted set of characters -consists of digits, letters, and a few graphic symbols chosen from -those common to most of the character encodings and input facilities -available to Internet users: +defined by RFC 3986. - "A" .. "Z", "a" .. "z", "0" .. "9", - ";", "/", "?", ":", "@", "&", "=", "+", "$", ",", "[", "]", # reserved - "-", "_", ".", "!", "~", "*", "'", "(", ")" +A URI consists of a restricted set of characters. The restricted set +of characters consists of digits, letters, and a few graphic symbols +chosen from those common to most of the character encodings and input +facilities available to Internet users. They are made up of the +"unreserved" and "reserved" character sets as defined in RFC 3986. + + unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" + reserved = ":" / "/" / "?" / "#" / "[" / "]" / "@" + "!" / "$" / "&" / "'" / "(" / ")" + / "*" / "+" / "," / ";" / "=" In addition, any byte (octet) can be represented in a URI by an escape sequence: a triplet consisting of the character "%" followed by two hexadecimal digits. A byte can also be represented directly by a -character, using the US-ASCII character for that octet (iff the -character is part of C<uric>). +character, using the US-ASCII character for that octet. -Some of the C<uric> characters are I<reserved> for use as delimiters -or as part of certain URI components. These must be escaped if they are -to be treated as ordinary data. Read RFC 2396 for further details. +Some of the characters are I<reserved> for use as delimiters or as +part of certain URI components. These must be escaped if they are to +be treated as ordinary data. Read RFC 3986 for further details. The functions provided (and exported by default) from this module are: @@ -61,10 +62,10 @@ character class (between [ ]). E.g.: "^A-Za-z" # everything not a letter The default set of characters to be escaped is all those which are -I<not> part of the C<uric> character class shown above as well as the -reserved characters. I.e. the default is: +I<not> part of the C<unreserved> character class shown above as well +as the reserved characters. I.e. the default is: - "^A-Za-z0-9\-_.!~*'()" + "^A-Za-z0-9\-\._~" =item uri_escape_utf8( $string ) @@ -156,6 +157,11 @@ for (0..255) { my %subst; # compiled patternes +my %Unsafe = ( + RFC2732 => qr/[^A-Za-z0-9\-_.!~*'()]/, + RFC3986 => qr/[^A-Za-z0-9\-\._~"]/, +); + sub uri_escape { my($text, $patn) = @_; @@ -169,8 +175,7 @@ sub uri_escape } &{$subst{$patn}}($text); } else { - # Default unsafe characters. RFC 2732 ^(uric - reserved) - $text =~ s/([^A-Za-z0-9\-_.!~*'()])/$escapes{$1} || _fail_hi($1)/ge; + $text =~ s/($Unsafe{RFC3986})/$escapes{$1} || _fail_hi($1)/ge; } $text; } diff --git a/t/escape.t b/t/escape.t index 46da877..7867160 100644 --- a/t/escape.t +++ b/t/escape.t @@ -3,7 +3,7 @@ use strict; use warnings; -use Test::More tests => 10; +use Test::More tests => 11; use URI::Escape; @@ -11,6 +11,9 @@ is uri_escape("|abc is uri_escape("abc", "b-d"), "a%62%63"; +# New escapes in RFC 3986 +is uri_escape("~*'()"), "~%2A%27%28%29"; + is uri_escape(undef), undef; is uri_unescape("%7Cabc%e5"), "|abcå"; @@ -35,5 +38,3 @@ SKIP: { is uri_escape_utf8(chr(0xFFF)), "%E0%BF%BF"; } - - -- 1.6.6.1

Fri Jan 29 18:01:56 2010 GAAS [...] cpan.org - Correspondence added

Thanks Michael! Your patches has been applied as: a3a2e2c Update URI::Escape for RFC 3986 cc44daf Rewrite the URI::Escape tests with Test::More

Fri Jan 29 18:01:57 2010 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'