Bug #38649 for URI-Find: URIs like http://x.org/a_(b) are found as as http://x.org/a

Thu Aug 21 21:13:38 2008 avar [...] cpan.org - Ticket created

Subject:

URIs like http://x.org/a_(b) are found as as http://x.org/a_(b -- Without the closing ")"

$ perl -MURI::Find -wle 'print $URI::Find::VERSION; my $finder = URI::Find->new(sub { print shift }); $finder->find(\shift())' 'http://x.org/a_(b)' 0.16 http://x.org/a_(b

Fri Aug 22 15:38:23 2008 mschwern [...] cpan.org - Correspondence added

This is due to the "decrufting" process which prevents probable punctuation from being considered part of the URL. One possible way to make that process smarter is to scan before the URI for a matching delimiter. For example, if a URL ends with a ) it would scan before the URL for a ( before it hits another ). If it finds one, it's probably inside a () and can strip the ). Otherwise the ) is probably part of the URL and it can leave it. This will only work with ] and ). ' and " will probably generate too many false positives.

Fri Aug 22 15:38:26 2008 The RT System itself - Status changed from 'new' to 'open'

Fri Aug 22 19:25:10 2008 avar [...] cpan.org - Correspondence added

On Fri Aug 22 15:38:23 2008, MSCHWERN wrote: Show quoted text

> This is due to the "decrufting" process which prevents probable > punctuation from being considered part of the URL. > > One possible way to make that process smarter is to scan before the URI > for a matching delimiter. For example, if a URL ends with a ) it would > scan before the URL for a ( before it hits another ). If it finds one, > it's probably inside a () and can strip the ). Otherwise the ) is > probably part of the URL and it can leave it. > > This will only work with ] and ). ' and " will probably generate too > many false positives.

Yeah that's basically how you have to do it, see if there are any existing open parens inside the url and if so try to match them up. I've always been quite happy with how gnus in Emacs does it, for reference here's a patch I sent to rcirc.el to emacs-devel because they were having the same problem. It does just what you suggest and matches up parens if they're open already: ---- From: avar@cpan.org (Ævar Arnfjörð Bjarmason) Subject: [PATCH] Make rcirc.el rcirc-url-regexp use the gnus-button-url-regexp regexp Newsgroups: gmane.emacs.devel To: emacs-devel@gnu.org Date: Sun, 23 Dec 2007 03:03:09 +0000 I was having an issue with rcirc including at the end of URIs. I fixed it by using the regex gnus uses. Perhaps it's better to amend the old one. Index: net/rcirc.el =================================================================== RCS file: /sources/emacs/emacs/lisp/net/rcirc.el,v retrieving revision 1.40 diff -u -r1.40 rcirc.el --- net/rcirc.el 1 Nov 2007 03:51:47 -0000 1.40 +++ net/rcirc.el 23 Dec 2007 02:36:07 -0000 @@ -2121,24 +2121,26 @@ (rcirc-add-face 0 (length string) face string) string)) +;; The regexp is copied from gnus-button-url-regexp in gnus-art.el (defvar rcirc-url-regexp - (rx-to-string - `(and word-boundary - (or (and - (or (and (or "http" "https" "ftp" "file" "gopher" "news" - "telnet" "wais" "mailto") - "://") - "www.") - (1+ (char "-a-zA-Z0-9_.")) - (1+ (char "-a-zA-Z0-9_")) - (optional ":" (1+ (char "0-9")))) - (and (1+ (char "-a-zA-Z0-9_.")) - (or ".com" ".net" ".org") - word-boundary)) - (optional - (and "/" - (1+ (char "-a-zA-Z0-9_='!?#$\@~`%&*+|\\/:;.,{}[]()")) - (char "-a-zA-Z0-9_=#$\@~`%&*+|\\/:;{}[]()"))))) + (concat + "\\b\$\\(www\\.\\|\\(s?https?\\|ftp\\|file\\|gopher\\|" + "nntp\\|news\\|telnet\\|wais\\|mailto\\|info\$:\\)" + "\$//[-a-z0-9_.]+:[0-9]*\$?" + (if (string-match "[[:digit:]]" "1") ;; Support POSIX? + (let ((chars "-a-z0-9_=#$@~%&*+\\/[:word:]") + (punct "!?:;.,")) + (concat + "\$?:" + ;; Match paired parentheses, e.g. in Wikipedia URLs: + "[" chars punct "]+" "(" "[" chars punct "]+" "[" chars "]*)" "[" chars "]" + "\\|" + "[" chars punct "]+" "[" chars "]" + "\$")) + (concat ;; XEmacs 21.4 doesn't support POSIX. + "\$[-a-z0-9_=!?#$@~%&*+\\/:;.,]\\|\\w\$+" + "\$[-a-z0-9_=#$@~%&*+\\/]\\|\\w\$")) + "\\)") "Regexp matching URLs. Set to nil to disable URL features in rcirc.") (defun rcirc-browse-url (&optional arg) Show quoted text

_______________________________________________ Emacs-devel mailing list Emacs-devel@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-devel

Fri Aug 22 20:24:10 2008 schwern [...] pobox.com - Correspondence added

Subject:	Re: [rt.cpan.org #38649] URIs like http://x.org/a_(b) are found as as http://x.org/a_(b -- Without the closing ")"
Date:	Fri, 22 Aug 2008 17:21:41 -0700
To:	bug-URI-Find [...] rt.cpan.org
From:	Michael G Schwern <schwern [...] pobox.com>

AEvar Arnfjord Bjarmason via RT wrote: Show quoted text

> Queue: URI-Find > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38649 > > > On Fri Aug 22 15:38:23 2008, MSCHWERN wrote:

>> This is due to the "decrufting" process which prevents probable >> punctuation from being considered part of the URL. >> >> One possible way to make that process smarter is to scan before the URI >> for a matching delimiter. For example, if a URL ends with a ) it would >> scan before the URL for a ( before it hits another ). If it finds one, >> it's probably inside a () and can strip the ). Otherwise the ) is >> probably part of the URL and it can leave it. >> >> This will only work with ] and ). ' and " will probably generate too >> many false positives.

> > Yeah that's basically how you have to do it, see if there are any > existing open parens inside the url and if so try to match them up.

I guess both inside and outside the URL have to be considered. The following would be decrufted, because there's a ( before the URL. Lipsom whatever stuff (you can find that at http://www.foo.com). In the following only the trailing . would be decrufted, because there is no preceding ( before the URL and the () matches inside the URL. The URL is http://x.org/a_(b). This would not be decrufted because of the matching ( inside the URL. Lipsom whatever stuff (you can find that at http://z.org/a_(b) online). -- 44. I am not the atheist chaplain. -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army http://skippyslist.com/list/

Mon Mar 16 19:17:25 2009 mschwern [...] cpan.org - Correspondence added

Now that we capture the text before the URI this is now easier to implement.

Tue May 26 11:03:59 2009 FISH [...] cpan.org - Correspondence added

On Do. 21. Aug. 2008, 21:13:38, AVAR wrote: Show quoted text

> $ perl -MURI::Find -wle 'print $URI::Find::VERSION; my $finder = > URI::Find->new(sub { print shift }); $finder->find(\shift())' > 'http://x.org/a_(b)' > 0.16 > http://x.org/a_(b

I stumpled about the same issue but with a much bigger impact: perl -MURI::Find -wle 'my $x = shift; my $finder = URI::Find->new(sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; }); $finder->find(\$x); print $x' 'http://x.org/a_(b)' <a href="http://x.org/a_(b">http://x.org/a_(b</a>) I takes the ) from the end of the url (in decruft) and appends it _at the end of the whole replacement_ (in recruft). If put some warnings in the Find.pm. Here is the output: perl -MURI::Find -wle 'my $x = shift; my $finder = URI::Find->new(sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; }); $finder->find(\$x); print $x' 'http://x.org/a_(b)' orig_match(http://x.org/a_(b)) at /opt/perl/lib/site_perl/5.10.0/URI/Find.pm line 305. end_cruft()) at /opt/perl/lib/site_perl/5.10.0/URI/Find.pm line 309. start_cruft(), uri(<a href="http://x.org/a_(b">http://x.org/a_(b</a>), end_cruft()) at /opt/perl/lib/site_perl/5.10.0/URI/Find.pm line 328. <a href="http://x.org/a_(b">http://x.org/a_(b</a>) That leads to malformed output if the url contains a already escaped html entity. Thats what happens to me with Angerwhale...

Bug #38649 for URI-Find: URIs like http://x.org/a_(b) are found as as http://x.org/a_(b -- Without the closing ")"

Preferred bug tracker