Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the URI-Find CPAN distribution.

Report information
The Basics
Id: 20486
Status: resolved
Priority: 0/
Queue: URI-Find

People
Owner: Nobody in particular
Requestors: hiranotaka [...] zng.info
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: the sample using escapeHTML doesn't always work safely
The sample in the POD: use CGI qw(escapeHTML); $text = "<pre>\n" . escapeHTML($text) . "</pre>\n"; my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; }); $finder->find(\$text); doesn't always work safely. Entity references can be broken. For instance, if the input is "<http://www.cpan.org/>", then the output is: "&lt;<a href="http://www.cpan.org/&gt">http://www.cpan.org/&gt</a>;"
On Sun Jul 16 08:10:44 2006, hiranotaka wrote: Show quoted text
> The sample in the POD: > use CGI qw(escapeHTML); > $text = "<pre>\n" . escapeHTML($text) . "</pre>\n"; > my $finder = URI::Find->new( > ... > doesn't always work safely.
Neither does it work with URLs that have &s embedded. This text This text contains http://foo.com?foo=1&bar=2. results in This text contains <a ref="http://foo.com?foo=1&amp;bar=2"> http://foo.com?foo=1&amp;bar=2</a>. hence escaping the '&' in the HREF attribute to '&amp', which breaks the link. To get reliable results, I guess one would have to 1) escape the URLs, but remember which text regions their replacements span 2) escape the rest, carefully omitting anything that has been touched in 1). So URI::Find would need to provide these areas via an API after having performed the replacement.
Ok, I've attached a patch that allows an optional filter function to the find() method, so you can call it like use CGI qw(escapeHTML); # ... my $how_many_found = $finder->find(\$text, \&escapeHTML); and it will both replace the URIs and escape the rest according to escapeHTML. This fixes the bug(s) described earlier. I've also included documentation and 10 test cases. Would be great if you could apply it! -- Mike
? Makefile ? blib ? patch.txt ? pm_to_blib Index: lib/URI/Find.pm =================================================================== RCS file: /home/mikeschilli/cvs/URI-Find/lib/URI/Find.pm,v retrieving revision 1.1.1.1 diff -a -u -r1.1.1.1 Find.pm --- lib/URI/Find.pm 5 Sep 2008 07:55:51 -0000 1.1.1.1 +++ lib/URI/Find.pm 5 Sep 2008 07:57:37 -0000 @@ -96,11 +96,23 @@ $text is a string to search and possibly modify with your callback. +Alternatively, C<find> can be called with a replacement function for +the rest of the text: + + use CGI qw(escapeHTML); + # ... + my $how_many_found = $finder->find(\$text, \&escapeHTML); + +will not only call the callback function for every URL found (and +perform the replacement instructions therein), but also run the rest +of the text through C<escapeHTML()>. This makes it easier to turn +plain text which contains URLs into HTML (see example below). + =cut sub find { - @_ == 2 || __PACKAGE__->badinvo; - my($self, $r_text) = @_; + @_ == 2 || @_ == 3 || __PACKAGE__->badinvo; + my($self, $r_text, $escape_func) = @_; my $urlsfound = 0; @@ -116,10 +128,16 @@ my $uriRe = sprintf '(?:%s|%s)', $self->uri_re, $self->schemeless_uri_re; - $$r_text =~ s{(<$uriRe>|$uriRe)}{ - my($orig_match) = $1; + $$r_text =~ s{(.*?)(<$uriRe>|$uriRe)|(?:(.+)$)}{ + my $orig_match = $2; + my $prematch = $1; + my $replacement = ""; + if(defined $3) { + $orig_match = ""; + $prematch = $3; + } - # A heruristic. Often you'll see things like: + # A heuristic. Often you'll see things like: # "I saw this site, http://www.foo.com, and its really neat!" # or "Foo Industries (at http://www.foo.com)" # We want to avoid picking up the trailing paren, period or comma. @@ -127,17 +145,23 @@ # not it corrects a parse mistake. $orig_match = $self->decruft($orig_match); - if( my $uri = $self->_is_uri(\$orig_match) ) { # Its a URI. + if( length $orig_match and + my $uri = $self->_is_uri(\$orig_match) ) { # Its a URI. $urlsfound++; - # Don't forget to put any cruft we accidentally matched back. - $self->recruft($self->{callback}->($uri, $orig_match)); + $replacement = $self->{callback}->($uri, $orig_match); } else { # False alarm. - # Again, don't forget the cruft. - $self->recruft($orig_match); + $replacement = $orig_match; + } + + if(defined $escape_func) { + $prematch = $escape_func->($prematch); } - }eg; + + # return concatenation of escaped prematch and recrufted URI + $prematch . $self->recruft($replacement); + }gse; URI::URL::strict($old_strict); return $urlsfound; @@ -371,13 +395,15 @@ use CGI qw(escapeHTML); - $text = "<pre>\n" . escapeHTML($text) . "</pre>\n"; my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; - }); + }, + \&escapeHTML + ); $finder->find(\$text); + print "<pre>$text</pre> =head1 CAVEATS, BUGS, ETC... Index: t/Find.t =================================================================== RCS file: /home/mikeschilli/cvs/URI-Find/t/Find.t,v retrieving revision 1.1.1.1 diff -a -u -r1.1.1.1 Find.t --- t/Find.t 5 Sep 2008 07:55:52 -0000 1.1.1.1 +++ t/Find.t 5 Sep 2008 07:57:37 -0000 @@ -244,3 +244,37 @@ $f->find(\$t); ok $val == URI::URL::strict, "URI::URL::strict $val"; } + +# Test new filter function + +BEGIN { $Total_tests += 10 } +my @tasks = ( + ["Foo&Bar http://abc.com.", "Foo&amp;Bar xx&."], + ["http://abc.com. http://abc.com.", "xx&. xx&."], + ["http://abc.com?foo=bar&baz=foo", "xx&"], + ["& http://abc.com?foo=bar&baz=foo", "&amp; xx&"], + ["http://abc.com?foo=bar&baz=foo &", "xx& &amp;"], + ["Foo&Bar http://abc.com", "Foo&amp;Bar xx&"], + ["http://abc.com. Foo&Bar", "xx&. Foo&amp;Bar"], + ["Foo&Bar http://abc.com. Foo&Bar", "Foo&amp;Bar xx&. Foo&amp;Bar"], + ["Foo&Bar\nhttp://abc.com.\nFoo&Bar", "Foo&amp;Bar\nxx&.\nFoo&amp;Bar"], + ["Foo&Bar\nhttp://abc.com. http://def.com.\nFoo&Bar", + "Foo&amp;Bar\nxx&. xx&.\nFoo&amp;Bar"], +); + +for my $task (@tasks) { + my($str, $result) = @$task; + my $org = $str; + my $f = URI::Find->new(sub { return "xx&" }); + $f->find(\$str, \&simple_escape); + ok($str eq $result, "[$org => $result] vs. [$str]"); +} + +sub simple_escape { + my($toencode) = @_; + +# print "Escaping $toencode\n"; + $toencode =~ s{&}{&amp;}gso; +# print "Result: $toencode\n"; + return $toencode; +}
Guys, it's been more than three months since I've submitted this patch, any chance that you could apply it? If you don't have time, I'd be more than happy to apply it myself if you grant me co-ownership of the module. Thanks! -- Mike