Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the URI-Find CPAN distribution.

Report information
The Basics
Id: 38649
Status: open
Priority: 0/
Queue: URI-Find

People
Owner: Nobody in particular
Requestors: avar [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.16
Fixed in: (no value)



Subject: URIs like http://x.org/a_(b) are found as as http://x.org/a_(b -- Without the closing ")"
$ perl -MURI::Find -wle 'print $URI::Find::VERSION; my $finder = URI::Find->new(sub { print shift }); $finder->find(\shift())' 'http://x.org/a_(b)' 0.16 http://x.org/a_(b
This is due to the "decrufting" process which prevents probable punctuation from being considered part of the URL. One possible way to make that process smarter is to scan before the URI for a matching delimiter. For example, if a URL ends with a ) it would scan before the URL for a ( before it hits another ). If it finds one, it's probably inside a () and can strip the ). Otherwise the ) is probably part of the URL and it can leave it. This will only work with ] and ). ' and " will probably generate too many false positives.
On Fri Aug 22 15:38:23 2008, MSCHWERN wrote: Show quoted text
> This is due to the "decrufting" process which prevents probable > punctuation from being considered part of the URL. > > One possible way to make that process smarter is to scan before the URI > for a matching delimiter. For example, if a URL ends with a ) it would > scan before the URL for a ( before it hits another ). If it finds one, > it's probably inside a () and can strip the ). Otherwise the ) is > probably part of the URL and it can leave it. > > This will only work with ] and ). ' and " will probably generate too > many false positives.
Yeah that's basically how you have to do it, see if there are any existing open parens inside the url and if so try to match them up. I've always been quite happy with how gnus in Emacs does it, for reference here's a patch I sent to rcirc.el to emacs-devel because they were having the same problem. It does just what you suggest and matches up parens if they're open already: ---- From: avar@cpan.org (Ævar Arnfjörð Bjarmason) Subject: [PATCH] Make rcirc.el rcirc-url-regexp use the gnus-button-url-regexp regexp Newsgroups: gmane.emacs.devel To: emacs-devel@gnu.org Date: Sun, 23 Dec 2007 03:03:09 +0000 I was having an issue with rcirc including at the end of URIs. I fixed it by using the regex gnus uses. Perhaps it's better to amend the old one. Index: net/rcirc.el =================================================================== RCS file: /sources/emacs/emacs/lisp/net/rcirc.el,v retrieving revision 1.40 diff -u -r1.40 rcirc.el --- net/rcirc.el 1 Nov 2007 03:51:47 -0000 1.40 +++ net/rcirc.el 23 Dec 2007 02:36:07 -0000 @@ -2121,24 +2121,26 @@ (rcirc-add-face 0 (length string) face string) string)) +;; The regexp is copied from gnus-button-url-regexp in gnus-art.el (defvar rcirc-url-regexp - (rx-to-string - `(and word-boundary - (or (and - (or (and (or "http" "https" "ftp" "file" "gopher" "news" - "telnet" "wais" "mailto") - "://") - "www.") - (1+ (char "-a-zA-Z0-9_.")) - (1+ (char "-a-zA-Z0-9_")) - (optional ":" (1+ (char "0-9")))) - (and (1+ (char "-a-zA-Z0-9_.")) - (or ".com" ".net" ".org") - word-boundary)) - (optional - (and "/" - (1+ (char "-a-zA-Z0-9_='!?#$\@~`%&*+|\\/:;.,{}[]()")) - (char "-a-zA-Z0-9_=#$\@~`%&*+|\\/:;{}[]()"))))) + (concat + "\\b\\(\\(www\\.\\|\\(s?https?\\|ftp\\|file\\|gopher\\|" + "nntp\\|news\\|telnet\\|wais\\|mailto\\|info\\):\\)" + "\\(//[-a-z0-9_.]+:[0-9]*\\)?" + (if (string-match "[[:digit:]]" "1") ;; Support POSIX? + (let ((chars "-a-z0-9_=#$@~%&*+\\/[:word:]") + (punct "!?:;.,")) + (concat + "\\(?:" + ;; Match paired parentheses, e.g. in Wikipedia URLs: + "[" chars punct "]+" "(" "[" chars punct "]+" "[" chars "]*)" "[" chars "]" + "\\|" + "[" chars punct "]+" "[" chars "]" + "\\)")) + (concat ;; XEmacs 21.4 doesn't support POSIX. + "\\([-a-z0-9_=!?#$@~%&*+\\/:;.,]\\|\\w\\)+" + "\\([-a-z0-9_=#$@~%&*+\\/]\\|\\w\\)")) + "\\)") "Regexp matching URLs. Set to nil to disable URL features in rcirc.") (defun rcirc-browse-url (&optional arg) Show quoted text
_______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel
Subject: Re: [rt.cpan.org #38649] URIs like http://x.org/a_(b) are found as as http://x.org/a_(b -- Without the closing ")"
Date: Fri, 22 Aug 2008 17:21:41 -0700
To: bug-URI-Find [...] rt.cpan.org
From: Michael G Schwern <schwern [...] pobox.com>
AEvar Arnfjord Bjarmason via RT wrote: Show quoted text
> Queue: URI-Find > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38649 > > > On Fri Aug 22 15:38:23 2008, MSCHWERN wrote:
>> This is due to the "decrufting" process which prevents probable >> punctuation from being considered part of the URL. >> >> One possible way to make that process smarter is to scan before the URI >> for a matching delimiter. For example, if a URL ends with a ) it would >> scan before the URL for a ( before it hits another ). If it finds one, >> it's probably inside a () and can strip the ). Otherwise the ) is >> probably part of the URL and it can leave it. >> >> This will only work with ] and ). ' and " will probably generate too >> many false positives.
> > Yeah that's basically how you have to do it, see if there are any > existing open parens inside the url and if so try to match them up.
I guess both inside and outside the URL have to be considered. The following would be decrufted, because there's a ( before the URL. Lipsom whatever stuff (you can find that at http://www.foo.com). In the following only the trailing . would be decrufted, because there is no preceding ( before the URL and the () matches inside the URL. The URL is http://x.org/a_(b). This would not be decrufted because of the matching ( inside the URL. Lipsom whatever stuff (you can find that at http://z.org/a_(b) online). -- 44. I am not the atheist chaplain. -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army http://skippyslist.com/list/
Now that we capture the text before the URI this is now easier to implement.
On Do. 21. Aug. 2008, 21:13:38, AVAR wrote: Show quoted text
> $ perl -MURI::Find -wle 'print $URI::Find::VERSION; my $finder = > URI::Find->new(sub { print shift }); $finder->find(\shift())' > 'http://x.org/a_(b)' > 0.16 > http://x.org/a_(b
I stumpled about the same issue but with a much bigger impact: perl -MURI::Find -wle 'my $x = shift; my $finder = URI::Find->new(sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; }); $finder->find(\$x); print $x' 'http://x.org/a_(b)' <a href="http://x.org/a_(b">http://x.org/a_(b</a>) I takes the ) from the end of the url (in decruft) and appends it _at the end of the whole replacement_ (in recruft). If put some warnings in the Find.pm. Here is the output: perl -MURI::Find -wle 'my $x = shift; my $finder = URI::Find->new(sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; }); $finder->find(\$x); print $x' 'http://x.org/a_(b)' orig_match(http://x.org/a_(b)) at /opt/perl/lib/site_perl/5.10.0/URI/Find.pm line 305. end_cruft()) at /opt/perl/lib/site_perl/5.10.0/URI/Find.pm line 309. start_cruft(), uri(<a href="http://x.org/a_(b">http://x.org/a_(b</a>), end_cruft()) at /opt/perl/lib/site_perl/5.10.0/URI/Find.pm line 328. <a href="http://x.org/a_(b">http://x.org/a_(b</a>) That leads to malformed output if the url contains a already escaped html entity. Thats what happens to me with Angerwhale...