Skip Menu |

This queue is for tickets about the Regexp-Common CPAN distribution.

Report information
The Basics
Id: 2907
Status: open
Priority: 0/
Queue: Regexp-Common

People
Owner: Nobody in particular
Requestors: MARKSTOS [...] cpan.org
Cc: hirose31 [...] gmail.com
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 2.113
Fixed in: (no value)



Subject: support for HTTP URIs with internal hyperlinks
Hello, I'd like to have an option to the HTTP matcher so that it can have the possibility of matching URIs which include a reference to a named anchor on the page, like this: http://example.com/index.html#halfway_down_the_page Currently the "#" and the characters after it don't seem to be included. Thanks! Mark
From: perl [...] crystalflame.net
[MARKSTOS - Sun Jul 6 09:05:14 2003]: Show quoted text
> Hello, > > I'd like to have an option to the HTTP matcher so that it can have the > possibility of matching URIs which include a reference to a named anchor > on the page, like this: > > http://example.com/index.html#halfway_down_the_page > > Currently the "#" and the characters after it don't seem to be included.
The published URI specification, RFC 2396, excludes #anchors from the URI spec. If you're using RE_URI or RE_URI_HTTP as currently shipped with this module, you'll find it strictly adheres to the specification. Abigail, the draft replacement for RFC 2396 has fixed this "bug": URI fragments are now acceptable as part of a URI. You can review the draft specification at http://gbiv.com/ protocols/uri/rev-2002/issues.html , and I would petition you to update the URI regular expressions to support URI fragments. If this is a task you'd rather someone else do, I'd be happy to.
From: hirose31 [...] gmail.com
On 月曜日 12月 22 18:49:27 2003, guest wrote: Show quoted text
> The published URI specification, RFC 2396, excludes #anchors from the > URI spec. If > you're using RE_URI or RE_URI_HTTP as currently shipped with this > module, you'll find it > strictly adheres to the specification. > > Abigail, the draft replacement for RFC 2396 has fixed this "bug": URI > fragments are now > acceptable as part of a URI. You can review the draft specification > at http://gbiv.com/ > protocols/uri/rev-2002/issues.html , and I would petition you to > update the URI regular > expressions to support URI fragments. > > If this is a task you'd rather someone else do, I'd be happy to.
RFC3986 obsoletes RFC2396 and RFC3986 say: URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] I attach a patch to include #fragment to $RE{URI}{HTTP} and capture as $9. Would you like to apply this patch?
--- /usr/local/share/perl/5.8.4/Regexp/Common/URI/http.pm 2004-06-10 06:42:48.000000000 +0900 +++ Regexp/Common/URI/http.pm 2006-10-21 00:53:48.000000000 +0900 @@ -7,14 +7,14 @@ use Regexp::Common qw /pattern clean no_defaults/; use Regexp::Common::URI qw /register_uri/; -use Regexp::Common::URI::RFC2396 qw /$host $port $path_segments $query/; +use Regexp::Common::URI::RFC2396 qw /$host $port $path_segments $query $fragment/; use vars qw /$VERSION/; ($VERSION) = q $Revision: 2.101 $ =~ /[\d.]+/g; my $http_uri = "(?k:(?k:http)://(?k:$host)(?::(?k:$port))?" . - "(?k:/(?k:(?k:$path_segments)(?:[?](?k:$query))?))?)"; + "(?k:/(?k:(?k:$path_segments)(?:[?](?k:$query))?))?(?k:#$fragment)?)"; register_uri HTTP => $http_uri;
RT-Send-CC: perl [...] crystalflame.net, hirose31 [...] gmail.com
Here is the link to the relevant section in RFC3986: http://tools.ietf.org/html/rfc3986#section-3.5
Subject: support for HTTP URIs with internal hyperlinks (next steps towards release?)
This bug report is about 8 years old and includes specific RFC reference to back it up, and a patch to fix it. What else can we do to the get the issue completely resolved and released? Mark
On Wed May 18 10:39:47 2011, MARKSTOS wrote: Show quoted text
> This bug report is about 8 years old and includes specific RFC reference > to back it up, and a patch to fix it. > > What else can we do to the get the issue completely resolved and > released? > > Mark
The patch is a little incomplete. $fragment isn't defined in RFC2396. We really need to make a RFC3986.pm to handle the new things. We also need to write tests to make sure URLs with and without fragments work. That's just my theory. I'm putting stuff together to make a pull request.
Show quoted text
> > The patch is a little incomplete. $fragment isn't defined in RFC2396. > We really need to make a RFC3986.pm to handle the new things.
I take that back. $fragment is defined so I'm adding the patch as defined but I'm going to try to add the userinfo stuff and test for those as well because otherwise it'll be two pull requests touching the same line which is an automatic conflict. Might as well do them both together.
Show quoted text
> I take that back. $fragment is defined so I'm adding the patch as > defined but I'm going to try to add the userinfo stuff and test for > those as well because otherwise it'll be two pull requests touching > the same line which is an automatic conflict. Might as well do them > both together.
Even before messing with the username/password part, adding the fragment code makes it fail around 6000 tests. I've got it down to around 1000 by changing the expectations of URI/http.t but I've still got to work on the testing to make things work right. The username/password part is going to break peoples use of the regex unless we use name only captures or only put the username/password capture on a new regex and let people continue using this one as-is.