The attached patches update URI::Escape to RFC 3986.
The first patch rewrites t/escape.t to use Test::More. This was just
some housecleaning to do before updating the test.
The second patch updates the docs to refer to RFC 3986 and updates the
default regex used to RFC 3986 standard.
You'll note I left in the 2732 regex. Its not used right now, but I can
imagine a scenario where URI::Escape offers the user which escape to
use. A flexible set of options can be added to uri_escape() in place of
the second argument. For example...
uri_escape( $uri, { rfc => 2732 } );
Would be equivalent to:
uri_escape( $uri, "^A-Za-z0-9\-_.!~*'()" );
Except the user won't get it wrong.
The current manual escape syntax can be deprecated and replaced with:
uri_escape( $uri, { escape => qr/[^...]/ } );
Or even...
uri_escape( $uri, { escape => ["A".."Z", "a".."z", ...] } );
uri_escape() would check if the second argument is a hash ref to
determine how to interpret it.
I can't see a use case for users preserving RFC 2732 semantics except
maybe to support fiddly old tests that they don't want to change, so I'm
not going to bother patching it. Its worth noting that the URI::Escape
patch did not effect the URI tests at all.
From c17e39a26bb4a2a7ea715ca75470e85111808d54 Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Thu, 28 Jan 2010 13:41:57 -0800
Subject: [PATCH 1/2] Rewrite the URI::Escape tests with Test::More
---
t/escape.t | 43 +++++++++++++++++--------------------------
1 files changed, 17 insertions(+), 26 deletions(-)
diff --git a/t/escape.t b/t/escape.t
index daebd9d..46da877 100644
--- a/t/escape.t
+++ b/t/escape.t
@@ -1,48 +1,39 @@
#!perl -w
-print "1..9\n";
+use strict;
+use warnings;
+
+use Test::More tests => 10;
use URI::Escape;
-print "not " unless uri_escape("|abcå") eq "%7Cabc%E5";
-print "ok 1\n";
+is uri_escape("|abcå"), "%7Cabc%E5";
-print "not " unless uri_escape("abc", "b-d") eq "a%62%63";
-print "ok 2\n";
+is uri_escape("abc", "b-d"), "a%62%63";
-print "not " if defined(uri_escape(undef));
-print "ok 3\n";
+is uri_escape(undef), undef;
-print "not " unless uri_unescape("%7Cabc%e5") eq "|abcå";
-print "ok 4\n";
+is uri_unescape("%7Cabc%e5"), "|abcå";
-print "not " unless join(":", uri_unescape("%40A%42", "CDE", "F%47H")) eq
- '@AB:CDE:FGH';
-print "ok 5\n";
+is_deeply [uri_unescape("%40A%42", "CDE", "F%47H")], [qw(@AB CDE FGH)];
use URI::Escape qw(%escapes);
-print "not" unless $escapes{"%"} eq "%25";
-print "ok 6\n";
+is $escapes{"%"}, "%25";
use URI::Escape qw(uri_escape_utf8);
-print "not " unless uri_escape_utf8("|abcå") eq "%7Cabc%C3%A5";
-print "ok 7\n";
+is uri_escape_utf8("|abcå"), "%7Cabc%C3%A5";
-if ($] < 5.008) {
- print "ok 8 # skip perl-5.8 required\n";
- print "ok 9 # skip perl-5.8 required\n";
-}
-else {
- eval { print uri_escape("abc" . chr(300)) };
- print "not " unless $@ && $@ =~ /^Can\'t escape \\x{012C}, try uri_escape_utf8\(\) instead/;
- print "ok 8\n";
+SKIP: {
+ skip "Perl 5.8.0 or higher required", 3 if $] < 5.008;
+
+ ok !eval { print uri_escape("abc" . chr(300)); 1 };
+ like $@, qr/^Can\'t escape \\x{012C}, try uri_escape_utf8\(\) instead/;
- print "not " unless uri_escape_utf8(chr(0xFFF)) eq "%E0%BF%BF";
- print "ok 9\n";
+ is uri_escape_utf8(chr(0xFFF)), "%E0%BF%BF";
}
--
1.6.6.1
From 24f4d462614e303bccb64675bfefe01fbd9ffc40 Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Thu, 28 Jan 2010 13:45:50 -0800
Subject: [PATCH 2/2] Update URI::Escape for RFC 3986
---
URI/Escape.pm | 43 ++++++++++++++++++++++++-------------------
t/escape.t | 7 ++++---
2 files changed, 28 insertions(+), 22 deletions(-)
diff --git a/URI/Escape.pm b/URI/Escape.pm
index c2da23b..4543212 100644
--- a/URI/Escape.pm
+++ b/URI/Escape.pm
@@ -15,26 +15,27 @@ URI::Escape - Escape and unescape unsafe characters
=head1 DESCRIPTION
This module provides functions to escape and unescape URI strings as
-defined by RFC 2396 (and updated by RFC 2732).
-A URI consists of a restricted set of characters,
-denoted as C<uric> in RFC 2396. The restricted set of characters
-consists of digits, letters, and a few graphic symbols chosen from
-those common to most of the character encodings and input facilities
-available to Internet users:
+defined by RFC 3986.
- "A" .. "Z", "a" .. "z", "0" .. "9",
- ";", "/", "?", ":", "@", "&", "=", "+", "$", ",", "[", "]", # reserved
- "-", "_", ".", "!", "~", "*", "'", "(", ")"
+A URI consists of a restricted set of characters. The restricted set
+of characters consists of digits, letters, and a few graphic symbols
+chosen from those common to most of the character encodings and input
+facilities available to Internet users. They are made up of the
+"unreserved" and "reserved" character sets as defined in RFC 3986.
+
+ unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
+ reserved = ":" / "/" / "?" / "#" / "[" / "]" / "@"
+ "!" / "$" / "&" / "'" / "(" / ")"
+ / "*" / "+" / "," / ";" / "="
In addition, any byte (octet) can be represented in a URI by an escape
sequence: a triplet consisting of the character "%" followed by two
hexadecimal digits. A byte can also be represented directly by a
-character, using the US-ASCII character for that octet (iff the
-character is part of C<uric>).
+character, using the US-ASCII character for that octet.
-Some of the C<uric> characters are I<reserved> for use as delimiters
-or as part of certain URI components. These must be escaped if they are
-to be treated as ordinary data. Read RFC 2396 for further details.
+Some of the characters are I<reserved> for use as delimiters or as
+part of certain URI components. These must be escaped if they are to
+be treated as ordinary data. Read RFC 3986 for further details.
The functions provided (and exported by default) from this module are:
@@ -61,10 +62,10 @@ character class (between [ ]). E.g.:
"^A-Za-z" # everything not a letter
The default set of characters to be escaped is all those which are
-I<not> part of the C<uric> character class shown above as well as the
-reserved characters. I.e. the default is:
+I<not> part of the C<unreserved> character class shown above as well
+as the reserved characters. I.e. the default is:
- "^A-Za-z0-9\-_.!~*'()"
+ "^A-Za-z0-9\-\._~"
=item uri_escape_utf8( $string )
@@ -156,6 +157,11 @@ for (0..255) {
my %subst; # compiled patternes
+my %Unsafe = (
+ RFC2732 => qr/[^A-Za-z0-9\-_.!~*'()]/,
+ RFC3986 => qr/[^A-Za-z0-9\-\._~"]/,
+);
+
sub uri_escape
{
my($text, $patn) = @_;
@@ -169,8 +175,7 @@ sub uri_escape
}
&{$subst{$patn}}($text);
} else {
- # Default unsafe characters. RFC 2732 ^(uric - reserved)
- $text =~ s/([^A-Za-z0-9\-_.!~*'()])/$escapes{$1} || _fail_hi($1)/ge;
+ $text =~ s/($Unsafe{RFC3986})/$escapes{$1} || _fail_hi($1)/ge;
}
$text;
}
diff --git a/t/escape.t b/t/escape.t
index 46da877..7867160 100644
--- a/t/escape.t
+++ b/t/escape.t
@@ -3,7 +3,7 @@
use strict;
use warnings;
-use Test::More tests => 10;
+use Test::More tests => 11;
use URI::Escape;
@@ -11,6 +11,9 @@ is uri_escape("|abc
is uri_escape("abc", "b-d"), "a%62%63";
+# New escapes in RFC 3986
+is uri_escape("~*'()"), "~%2A%27%28%29";
+
is uri_escape(undef), undef;
is uri_unescape("%7Cabc%e5"), "|abcå";
@@ -35,5 +38,3 @@ SKIP: {
is uri_escape_utf8(chr(0xFFF)), "%E0%BF%BF";
}
-
-
--
1.6.6.1