Bug #71658 for URI-Find: Escape function gets splitted up strings if a possible (but not) URL exists in string

Thu Oct 13 11:03:35 2011 mendoza [...] pvv.ntnu.no - Ticket created

Subject:

Replacement function does not get URIs not considered valid

Finds URIs that are not considered URIs, but does not call callback and the surrounding text callback does not get the invalid URI part I might be doing something here, but it seems something is going wrong, as if it half-way finds a URI, but not really: perl -MURI::Find -wle 'my $f = URI::Find->new(sub{ print "never called"; }); my $s="no urls here wtf:? "; $f->find(\$s, sub { print "(@_)" });' (no urls here ) ( )

Thu Oct 13 14:32:21 2011 mschwern [...] cpan.org - Correspondence added

That is correct. The function passed to find() is used to transform the surrounding text. It's useful if you're going to transform text into HTML. The last example in the docs shows that use. You're right that it lost "wtf:?". That's a bug.

Thu Oct 13 14:32:23 2011 The RT System itself - Status changed from 'new' to 'open'

Thu Oct 13 14:52:30 2011 mendoza [...] pvv.ntnu.no - Correspondence added

to. 13. okt. 2011 20.32.21 skrev MSCHWERN: Show quoted text

> That is correct. The function passed to find() is used to transform

the Show quoted text

> surrounding text. It's useful if you're going to transform text into > HTML. The last example in the docs shows that use. > > You're right that it lost "wtf:?". That's a bug.

It didn't really lose it, as if you print out the original string, it's there. BUT, the problem happens when you are in a "maybe_uri"-stage, but realize it's not. Then both the prefix and postfix text together with the maybe_uri part should be sent to the escape_func concatenated together. In my case a smiley "Am I allowed to :smile:? Haha." ends up as passing "Am I allowed to :", " Haha" to the escape func in turns. Even passing "Am I allowed to :", "smile:?", " Haha" in turns would be wrong (It wouldn't find ":smile:" anymore). I guess my suggestion would be to have some sort of return value for _uri_filter and also check the result of the recursive $self->find call before deciding to escape the parts separately, or as a whole string. Would you want me to provide a patch, or do you prefer to look at it yourself?

Thu Oct 13 16:00:19 2011 schwern [...] pobox.com - Correspondence added

Subject:	Re: [rt.cpan.org #71658] Replacement function does not get URIs not considered valid
Date:	Thu, 13 Oct 2011 13:00:00 -0700
To:	bug-URI-Find [...] rt.cpan.org
From:	Michael G Schwern <schwern [...] pobox.com>

On 2011.10.13 11:52 AM, Nicolas Mendoza via RT wrote: Show quoted text

>> You're right that it lost "wtf:?". That's a bug.

> > It didn't really lose it, as if you print out the original string, it's > there.

Yes. It should be sent to the transform callback like any other text. Show quoted text

> BUT, the problem happens when you are in a "maybe_uri"-stage, but > realize it's not. Then both the prefix and postfix text together with > the maybe_uri part should be sent to the escape_func concatenated > together. > > In my case a smiley "Am I allowed to :smile:? Haha." ends up as passing > "Am I allowed to :", " Haha" to the escape func in turns. Even passing > "Am I allowed to :", "smile:?", " Haha" in turns would be wrong (It > wouldn't find ":smile:" anymore). > > I guess my suggestion would be to have some sort of return value for > _uri_filter and also check the result of the recursive $self->find call > before deciding to escape the parts separately, or as a whole string.

Yes, I was thinking just that. Show quoted text

> Would you want me to provide a patch, or do you prefer to look at it > yourself?

I would love to have more people patching URI::Find. Knock yourself out. Start with some tests and go from there. Traditional patches are fine, but preferably work in a fork on Github. https://github.com/schwern/uri-find

Fri Oct 14 14:03:48 2011 mendoza [...] pvv.ntnu.no - Correspondence added

to. 13. okt. 2011 22.00.19 skrev schwern@pobox.com: Show quoted text

> On 2011.10.13 11:52 AM, Nicolas Mendoza via RT wrote:

> >> You're right that it lost "wtf:?". That's a bug.

> > > > It didn't really lose it, as if you print out the original string,

it's Show quoted text

> > there.

> > Yes. It should be sent to the transform callback like any other text. > >

> > BUT, the problem happens when you are in a "maybe_uri"-stage, but > > realize it's not. Then both the prefix and postfix text together

with Show quoted text

> > the maybe_uri part should be sent to the escape_func concatenated > > together. > > > > In my case a smiley "Am I allowed to :smile:? Haha." ends up as

passing Show quoted text

> > "Am I allowed to :", " Haha" to the escape func in turns. Even

passing Show quoted text

> > "Am I allowed to :", "smile:?", " Haha" in turns would be wrong (It > > wouldn't find ":smile:" anymore). > > > > I guess my suggestion would be to have some sort of return value

for Show quoted text

> > _uri_filter and also check the result of the recursive $self->find

call Show quoted text

> > before deciding to escape the parts separately, or as a whole

string. Show quoted text

> > Yes, I was thinking just that. > >

> > Would you want me to provide a patch, or do you prefer to look at

it Show quoted text

> > yourself?

> > I would love to have more people patching URI::Find. Knock yourself

out. Show quoted text

> Start with some tests and go from there. > > Traditional patches are fine, but preferably work in a fork on Github. > https://github.com/schwern/uri-find

I did clone the git repo, and made a diff locally, not sure how to do pull requests et all. Mind you the patch is quite extensive, as I really could not find an easier way to ensure the corner-cases. I'm really sorry for that, but I had to turn some thing up-side-down to work properly. All tests should pass including one new I added in filter.t. I don't know how the code works in perl 5.006, but tried to stay away of using external modules and exotic functions. If the speed is really a concern I could go over things better, but anyhow at leat oyu might get an idea of what might be needed to get the escape function to get the correct text. Cheers

Subject:

uri-find.patch

diff --git a/lib/URI/Find.pm b/lib/URI/Find.pm index 7b4ba62..36f66e7 100644 --- a/lib/URI/Find.pm +++ b/lib/URI/Find.pm @@ -115,9 +115,60 @@ sub find { $self->{_uris_found} = 0; - # Don't assume http. + my @parts = $self->_get_parts($r_text); + + # Don't assume http (used in _uri_filter calling _is_uri) my $old_strict = URI::URL::strict(1); + my @reduced = shift @parts; + for my $part (@parts) { + next if ! defined $part || (ref $part && (!defined $part->{str} || $part->{str} eq '')); + + # Find out if part needs filtering, if not, make it a regular string + # ready for reducing, and later escape function + if (ref $part && $part->{state} eq 'filter') { + my $filtered = $self->_uri_filter($part->{str}); + if (defined $filtered) { + $part->{str} = $filtered; + $part->{state} = 'filtered'; + } + else { + $part = $part->{orig}; + } + } + + # Reduce part, ie. collapse strings that follow eachother so that escape + # function gets as whole strings as possible + if ( ref $part ) { + push @reduced, $part; + } + elsif ( ! ref $reduced[-1] ) { + $reduced[-1] .= $part; + } + else { + push @reduced, $part; + } + } + + URI::URL::strict($old_strict); + + my $replace = ""; + + foreach my $part (@reduced) { + next if ! defined $part; + $replace .= ref $part ? $part->{str} : $escape_func->($part); + } + + ${$r_text} = $replace; + + return $self->{_uris_found}; + +} + + +sub _get_parts { + my ($self, $r_text) = @_; + # Yes, evil. Basically, look for something vaguely resembling a URL, # then hand it off to URI::URL for examination. If it passes, throw # it to a callback and put the result in its place. @@ -127,17 +178,21 @@ sub find { my $uriRe = sprintf '(?:%s|%s)', $self->uri_re, $self->schemeless_uri_re; + my @parts = (); + $$r_text =~ s{ (.*?) (?:(<(?:URL:)?)(.+?)(>)|($uriRe)) | (.+?)$ }{ - my $replace = ''; + my $pre = $2; + my $post = $4; + if( defined $6 ) { - $replace = $escape_func->($6); + push @parts, $6; } else { my $maybe_uri = ''; - $replace = $escape_func->($1); + push @parts, $1; - if( defined $2 ) { + if( defined $pre ) { $maybe_uri = $3; my $is_uri = do { # Don't alter $1... $maybe_uri =~ s/\s+//g; @@ -145,9 +200,11 @@ sub find { }; if( $is_uri ) { - $replace .= $escape_func->($2); - $replace .= $self->_uri_filter($maybe_uri); - $replace .= $escape_func->($4); + push @parts, + $pre, + { str => $maybe_uri, orig => $3, set_at => __LINE__, state => 'filter' }, + $post + ; } else { # the whole text inside of the <...> was not a url, but @@ -156,29 +213,21 @@ sub find { $maybe_uri = $3; $maybe_uri =~ /$uriRe/; }; - if( $has_uri ) { - my $pre = $2; - my $post = $4; - do { $self->find(\$maybe_uri, $escape_func) }; - $replace .= $escape_func->($pre); - $replace .= $maybe_uri; - $replace .= $escape_func->($post); - } - else { - $replace .= $escape_func->($2.$3.$4); - } + push @parts, + $pre, + ( $has_uri ? do { $self->_get_parts(\$maybe_uri) } : $3 ), + $post, + ; } } else { - $replace .= $self->_uri_filter($5); + push @parts, { str => $5, orig => $5, set_at => __LINE__, state => 'filter' }; } } - - $replace; + ""; }gsex; - URI::URL::strict($old_strict); - return $self->{_uris_found}; + return @parts; } @@ -201,7 +250,7 @@ sub _uri_filter { } else { # False alarm - $replacement = $orig_match; + return; } # Return recrufted replacement @@ -526,7 +575,7 @@ Darren Chamberlain wrote urifind. Copyright 2000, 2009-2010 by Michael G Schwern E<lt>schwern@pobox.comE<gt>. -This program is free software; you can redistribute it and/or +This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See F<http://www.perlfoundation.org/artistic_license_1_0> diff --git a/t/filter.t b/t/filter.t index c7636e4..8a3be72 100644 --- a/t/filter.t +++ b/t/filter.t @@ -21,6 +21,7 @@ my @tasks = ( ["Foo&Bar\nhttp://abc.com.\nFoo&Bar", "Foo&Bar\nxx&.\nFoo&Bar"], ["Foo&Bar\nhttp://abc.com. http://def.com.\nFoo&Bar", "Foo&Bar\nxx&. xx&.\nFoo&Bar"], + ["noturi:& should also be escaped", "noturi:& should also be escaped"], ); for my $task (@tasks) { @@ -33,7 +34,6 @@ for my $task (@tasks) { sub simple_escape { my($toencode) = @_; - $toencode =~ s{&}{&}gso; return $toencode; }

Thu Oct 20 20:36:08 2011 mschwern [...] cpan.org - Correspondence added

I figured out a way to do it with minimal change by placing the escape function temporarily inside the object so it can be accessed by _uri_filter as needed. It covers the additional corner cases I could come up with and code paths I could trace. See https://github.com/schwern/uri-find/commit/713038724c4c04dee786ba4e0c6b84e2466a342e and https://github.com/schwern/uri-find/commit/4d5390fbe2d51033cf26a1eead141f694dd88ebd Let me know if you find more edge cases it doesn't handle. v20111020 has been released.

Thu Oct 20 20:36:09 2011 mschwern [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Oct 23 12:10:23 2011 mendoza [...] pvv.ntnu.no - Correspondence added

fr. 21. okt. 2011 02.36.08 skrev MSCHWERN: Show quoted text

> I figured out a way to do it with minimal change by placing the escape > function temporarily inside the object so it can be accessed by > _uri_filter as needed. It covers the additional corner cases I could > come up with and code paths I could trace. > > See > https://github.com/schwern/uri- > find/commit/713038724c4c04dee786ba4e0c6b84e2466a342e > and > https://github.com/schwern/uri- > find/commit/4d5390fbe2d51033cf26a1eead141f694dd88ebd > > Let me know if you find more edge cases it doesn't handle. > > v20111020 has been released.

Sorry no dice, exactly because of what I tried to explain, the escape function gets the parts separately: $ perl filter_full_string.t called with: how are you : called with: smiley:? not ok 1 - expand_smilies how are you :smiley:? # Failed test 'expand_smilies how are you :smiley:?' # at filter_full_string.t line 22. # got: 'how are you :smiley:?' # expected: 'how are you <img src='smiley.gif'>?' 1..1 # Looks like you failed 1 test of 1. test file: #!/usr/bin/perl -w # Test the filter function making sure it gets full strings back, not just split parts around urls that look like urls but aren't. use strict; use Test::More 'no_plan'; use URI::Find; my @tasks = ( # Thing which looks like a URL but isn't ["how are you :smiley:?", "how are you <img src='smiley.gif'>?"], ); for my $task (@tasks) { my($str, $result) = @$task; my $org = $str; my $f = URI::Find->new(sub { return "xx&" }); $f->find(\$str, \&expand_smilies); is($str, $result, "expand_smilies $org"); } sub expand_smilies { my($str) = @_; print "called with: $str\n"; $str =~ s{\W/:smiley:\W/}{<img src='smiley.gif'>}gmx; return $str; }

Sun Oct 23 12:10:25 2011 The RT System itself - Status changed from 'resolved' to 'open'

Sun Oct 23 12:44:15 2011 mendoza [...] pvv.ntnu.no - Correspondence added

sø. 23. okt. 2011 18.10.23 skrev NICOMEN: Show quoted text

> fr. 21. okt. 2011 02.36.08 skrev MSCHWERN:

> > I figured out a way to do it with minimal change by placing the

escape Show quoted text

> > function temporarily inside the object so it can be accessed by > > _uri_filter as needed. It covers the additional corner cases I

could Show quoted text

> > come up with and code paths I could trace. > > > > See > > https://github.com/schwern/uri- > > find/commit/713038724c4c04dee786ba4e0c6b84e2466a342e > > and > > https://github.com/schwern/uri- > > find/commit/4d5390fbe2d51033cf26a1eead141f694dd88ebd > > > > Let me know if you find more edge cases it doesn't handle. > > > > v20111020 has been released.

> > Sorry no dice, exactly because of what I tried to explain, the escape > function gets the parts separately: > > $ perl filter_full_string.t > called with: how are you : > called with: smiley:? > not ok 1 - expand_smilies how are you :smiley:? > # Failed test 'expand_smilies how are you :smiley:?' > # at filter_full_string.t line 22. > # got: 'how are you :smiley:?' > # expected: 'how are you <img src='smiley.gif'>?' > 1..1 > # Looks like you failed 1 test of 1. > > test file: > > #!/usr/bin/perl -w > > # Test the filter function making sure it gets full strings back, not > just split parts around urls that look like urls but aren't. > > use strict; > > use Test::More 'no_plan'; > > use URI::Find; > > > my @tasks = ( > # Thing which looks like a URL but isn't > ["how are you :smiley:?", "how are you <img src='smiley.gif'>?"], > ); > > for my $task (@tasks) { > my($str, $result) = @$task; > my $org = $str; > my $f = URI::Find->new(sub { return "xx&" }); > $f->find(\$str, \&expand_smilies); > is($str, $result, "expand_smilies $org"); > } > > sub expand_smilies { > my($str) = @_; > print "called with: $str\n"; > $str =~ s{\W/:smiley:\W/}{<img src='smiley.gif'>}gmx; > return $str; > } > > >

Blah sorry, regexp should be: $str =~ s{(\W):(smiley):(\W)}{$1<img src='$2.gif'>$3}gmx; The point is at list as the function in this case should only be called once with: $ perl filter_full_string.t called with: how are you :smiley:? ok 1 - expand_smilies how are you :smiley:? 1..1 (Using the patch I provided here)

Tue Oct 25 21:41:50 2011 mendoza [...] pvv.ntnu.no - Broken in 20111020 added

Tue Oct 25 21:41:50 2011 mendoza [...] pvv.ntnu.no - Broken in 20100505 deleted

Tue Oct 25 21:43:07 2011 mendoza [...] pvv.ntnu.no - Subject changed from 'Replacement function does not get URIs not considered valid' to 'Escape function gets splitted up strings if a possible (but not) URL exists in string'

Thu Nov 03 15:23:47 2011 mschwern [...] cpan.org - Correspondence added

Good point. I'm not sure what to do about that, the original patch to fix it is almost a parallel system. What's probably necessary is to put the code in _uri_filter() that decrufts and does the final determination of it's really a URI back together with the escape function callbacks. As I see it, there's only one instance where this is necessary. if( $is_uri ) { $replace .= $escape_func->($2); $replace .= $self->_uri_filter($maybe_uri); $replace .= $escape_func->($4); }

Fri Nov 04 03:49:59 2011 mendoza [...] pvv.ntnu.no - Correspondence added

to. 03. nov. 2011 20.23.47 skrev MSCHWERN: Show quoted text

> Good point.

I believe this has been my point since the beginning. Show quoted text

> I'm not sure what to do about that, the original patch to > fix it is almost a parallel system. >

And it works, and I believe it's faster too, and it does thing correctly. Show quoted text

> What's probably necessary is to put the code in _uri_filter() that > decrufts and does the final determination of it's really a URI back > together with the escape function callbacks. > > As I see it, there's only one instance where this is necessary. > > if( $is_uri ) { > $replace .= $escape_func->($2); > $replace .= $self->_uri_filter($maybe_uri); > $replace .= $escape_func->($4); > }

I don't think so, it's starting to be a while now, but believe me, I tried to patch this with as little impact as possible, and I don't think it was enough to just do something locally. I think it was because of the recursive call, and not knowing if parts or all of the result of that _uri_filter() call is a replaced uri, or not. Did you try the patch? It works nicely for us on http://my.opera.com and http://dev.opera.com, (I built our own .deb package containing your module with the patch applied.) Thanks for your time ;)

Bug #71658 for URI-Find: Escape function gets splitted up strings if a possible (but not) URL exists in string

Preferred bug tracker