Bug #20486 for URI-Find: the sample using escapeHTML doesn't always work safely

Sun Jul 16 08:10:44 2006 hiranotaka [...] zng.info - Ticket created

Subject:

the sample using escapeHTML doesn't always work safely

The sample in the POD: use CGI qw(escapeHTML); $text = "<pre>\n" . escapeHTML($text) . "</pre>\n"; my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; }); $finder->find(\$text); doesn't always work safely. Entity references can be broken. For instance, if the input is "<http://www.cpan.org/>", then the output is: "<<a href="http://www.cpan.org/&gt">http://www.cpan.org/&gt</a>;"

Tue Sep 02 18:35:37 2008 MSCHILLI [...] cpan.org - Correspondence added

On Sun Jul 16 08:10:44 2006, hiranotaka wrote: Show quoted text

> The sample in the POD: > use CGI qw(escapeHTML); > $text = "<pre>\n" . escapeHTML($text) . "</pre>\n"; > my $finder = URI::Find->new( > ... > doesn't always work safely.

Neither does it work with URLs that have &s embedded. This text This text contains http://foo.com?foo=1&bar=2. results in This text contains <a ref="http://foo.com?foo=1&bar=2"> http://foo.com?foo=1&bar=2</a>. hence escaping the '&' in the HREF attribute to '&amp', which breaks the link. To get reliable results, I guess one would have to 1) escape the URLs, but remember which text regions their replacements span 2) escape the rest, carefully omitting anything that has been touched in 1). So URI::Find would need to provide these areas via an API after having performed the replacement.

Tue Sep 02 18:35:39 2008 The RT System itself - Status changed from 'new' to 'open'

Fri Sep 05 04:06:30 2008 MSCHILLI [...] cpan.org - Correspondence added

Ok, I've attached a patch that allows an optional filter function to the find() method, so you can call it like use CGI qw(escapeHTML); # ... my $how_many_found = $finder->find(\$text, \&escapeHTML); and it will both replace the URIs and escape the rest according to escapeHTML. This fixes the bug(s) described earlier. I've also included documentation and 10 test cases. Would be great if you could apply it! -- Mike

? Makefile ? blib ? patch.txt ? pm_to_blib Index: lib/URI/Find.pm =================================================================== RCS file: /home/mikeschilli/cvs/URI-Find/lib/URI/Find.pm,v retrieving revision 1.1.1.1 diff -a -u -r1.1.1.1 Find.pm --- lib/URI/Find.pm 5 Sep 2008 07:55:51 -0000 1.1.1.1 +++ lib/URI/Find.pm 5 Sep 2008 07:57:37 -0000 @@ -96,11 +96,23 @@ $text is a string to search and possibly modify with your callback. +Alternatively, C<find> can be called with a replacement function for +the rest of the text: + + use CGI qw(escapeHTML); + # ... + my $how_many_found = $finder->find(\$text, \&escapeHTML); + +will not only call the callback function for every URL found (and +perform the replacement instructions therein), but also run the rest +of the text through C<escapeHTML()>. This makes it easier to turn +plain text which contains URLs into HTML (see example below). + =cut sub find { - @_ == 2 || __PACKAGE__->badinvo; - my($self, $r_text) = @_; + @_ == 2 || @_ == 3 || __PACKAGE__->badinvo; + my($self, $r_text, $escape_func) = @_; my $urlsfound = 0; @@ -116,10 +128,16 @@ my $uriRe = sprintf '(?:%s|%s)', $self->uri_re, $self->schemeless_uri_re; - $$r_text =~ s{(<$uriRe>|$uriRe)}{ - my($orig_match) = $1; + $$r_text =~ s{(.*?)(<$uriRe>|$uriRe)|(?:(.+)$)}{ + my $orig_match = $2; + my $prematch = $1; + my $replacement = ""; + if(defined $3) { + $orig_match = ""; + $prematch = $3; + } - # A heruristic. Often you'll see things like: + # A heuristic. Often you'll see things like: # "I saw this site, http://www.foo.com, and its really neat!" # or "Foo Industries (at http://www.foo.com)" # We want to avoid picking up the trailing paren, period or comma. @@ -127,17 +145,23 @@ # not it corrects a parse mistake. $orig_match = $self->decruft($orig_match); - if( my $uri = $self->_is_uri(\$orig_match) ) { # Its a URI. + if( length $orig_match and + my $uri = $self->_is_uri(\$orig_match) ) { # Its a URI. $urlsfound++; - # Don't forget to put any cruft we accidentally matched back. - $self->recruft($self->{callback}->($uri, $orig_match)); + $replacement = $self->{callback}->($uri, $orig_match); } else { # False alarm. - # Again, don't forget the cruft. - $self->recruft($orig_match); + $replacement = $orig_match; + } + + if(defined $escape_func) { + $prematch = $escape_func->($prematch); } - }eg; + + # return concatenation of escaped prematch and recrufted URI + $prematch . $self->recruft($replacement); + }gse; URI::URL::strict($old_strict); return $urlsfound; @@ -371,13 +395,15 @@ use CGI qw(escapeHTML); - $text = "<pre>\n" . escapeHTML($text) . "</pre>\n"; my $finder = URI::Find->new( sub { my($uri, $orig_uri) = @_; return qq|<a href="$uri">$orig_uri</a>|; - }); + }, + \&escapeHTML + ); $finder->find(\$text); + print "<pre>$text</pre> =head1 CAVEATS, BUGS, ETC... Index: t/Find.t =================================================================== RCS file: /home/mikeschilli/cvs/URI-Find/t/Find.t,v retrieving revision 1.1.1.1 diff -a -u -r1.1.1.1 Find.t --- t/Find.t 5 Sep 2008 07:55:52 -0000 1.1.1.1 +++ t/Find.t 5 Sep 2008 07:57:37 -0000 @@ -244,3 +244,37 @@ $f->find(\$t); ok $val == URI::URL::strict, "URI::URL::strict $val"; } + +# Test new filter function + +BEGIN { $Total_tests += 10 } +my @tasks = ( + ["Foo&Bar http://abc.com.", "Foo&Bar xx&."], + ["http://abc.com. http://abc.com.", "xx&. xx&."], + ["http://abc.com?foo=bar&baz=foo", "xx&"], + ["& http://abc.com?foo=bar&baz=foo", "& xx&"], + ["http://abc.com?foo=bar&baz=foo &", "xx& &"], + ["Foo&Bar http://abc.com", "Foo&Bar xx&"], + ["http://abc.com. Foo&Bar", "xx&. Foo&Bar"], + ["Foo&Bar http://abc.com. Foo&Bar", "Foo&Bar xx&. Foo&Bar"], + ["Foo&Bar\nhttp://abc.com.\nFoo&Bar", "Foo&Bar\nxx&.\nFoo&Bar"], + ["Foo&Bar\nhttp://abc.com. http://def.com.\nFoo&Bar", + "Foo&Bar\nxx&. xx&.\nFoo&Bar"], +); + +for my $task (@tasks) { + my($str, $result) = @$task; + my $org = $str; + my $f = URI::Find->new(sub { return "xx&" }); + $f->find(\$str, \&simple_escape); + ok($str eq $result, "[$org => $result] vs. [$str]"); +} + +sub simple_escape { + my($toencode) = @_; + +# print "Escaping $toencode\n"; + $toencode =~ s{&}{&}gso; +# print "Result: $toencode\n"; + return $toencode; +}

Sat Dec 13 23:46:48 2008 MSCHILLI [...] cpan.org - Correspondence added

Guys, it's been more than three months since I've submitted this patch, any chance that you could apply it? If you don't have time, I'd be more than happy to apply it myself if you grant me co-ownership of the module. Thanks! -- Mike

Sat Mar 14 20:55:47 2009 mschwern [...] cpan.org - Correspondence added

It's been applied. http://github.com/schwern/uri-find/tree/master

Sat Mar 14 20:55:49 2009 mschwern [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #20486 for URI-Find: the sample using escapeHTML doesn't always work safely

Preferred bug tracker