Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the URI-Find CPAN distribution.

Report information
The Basics
Id: 71658
Status: open
Priority: 0/
Queue: URI-Find

People
Owner: Nobody in particular
Requestors: mendoza [...] pvv.ntnu.no
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 20111020
Fixed in: (no value)



Subject: Replacement function does not get URIs not considered valid
Finds URIs that are not considered URIs, but does not call callback and the surrounding text callback does not get the invalid URI part I might be doing something here, but it seems something is going wrong, as if it half-way finds a URI, but not really: perl -MURI::Find -wle 'my $f = URI::Find->new(sub{ print "never called"; }); my $s="no urls here wtf:? "; $f->find(\$s, sub { print "(@_)" });' (no urls here ) ( )
That is correct. The function passed to find() is used to transform the surrounding text. It's useful if you're going to transform text into HTML. The last example in the docs shows that use. You're right that it lost "wtf:?". That's a bug.
to. 13. okt. 2011 20.32.21 skrev MSCHWERN: Show quoted text
> That is correct. The function passed to find() is used to transform
the Show quoted text
> surrounding text. It's useful if you're going to transform text into > HTML. The last example in the docs shows that use. > > You're right that it lost "wtf:?". That's a bug.
It didn't really lose it, as if you print out the original string, it's there. BUT, the problem happens when you are in a "maybe_uri"-stage, but realize it's not. Then both the prefix and postfix text together with the maybe_uri part should be sent to the escape_func concatenated together. In my case a smiley "Am I allowed to :smile:? Haha." ends up as passing "Am I allowed to :", " Haha" to the escape func in turns. Even passing "Am I allowed to :", "smile:?", " Haha" in turns would be wrong (It wouldn't find ":smile:" anymore). I guess my suggestion would be to have some sort of return value for _uri_filter and also check the result of the recursive $self->find call before deciding to escape the parts separately, or as a whole string. Would you want me to provide a patch, or do you prefer to look at it yourself?
Subject: Re: [rt.cpan.org #71658] Replacement function does not get URIs not considered valid
Date: Thu, 13 Oct 2011 13:00:00 -0700
To: bug-URI-Find [...] rt.cpan.org
From: Michael G Schwern <schwern [...] pobox.com>
On 2011.10.13 11:52 AM, Nicolas Mendoza via RT wrote: Show quoted text
>> You're right that it lost "wtf:?". That's a bug.
> > It didn't really lose it, as if you print out the original string, it's > there.
Yes. It should be sent to the transform callback like any other text. Show quoted text
> BUT, the problem happens when you are in a "maybe_uri"-stage, but > realize it's not. Then both the prefix and postfix text together with > the maybe_uri part should be sent to the escape_func concatenated > together. > > In my case a smiley "Am I allowed to :smile:? Haha." ends up as passing > "Am I allowed to :", " Haha" to the escape func in turns. Even passing > "Am I allowed to :", "smile:?", " Haha" in turns would be wrong (It > wouldn't find ":smile:" anymore). > > I guess my suggestion would be to have some sort of return value for > _uri_filter and also check the result of the recursive $self->find call > before deciding to escape the parts separately, or as a whole string.
Yes, I was thinking just that. Show quoted text
> Would you want me to provide a patch, or do you prefer to look at it > yourself?
I would love to have more people patching URI::Find. Knock yourself out. Start with some tests and go from there. Traditional patches are fine, but preferably work in a fork on Github. https://github.com/schwern/uri-find
to. 13. okt. 2011 22.00.19 skrev schwern@pobox.com: Show quoted text
> On 2011.10.13 11:52 AM, Nicolas Mendoza via RT wrote:
> >> You're right that it lost "wtf:?". That's a bug.
> > > > It didn't really lose it, as if you print out the original string,
it's Show quoted text
> > there.
> > Yes. It should be sent to the transform callback like any other text. > >
> > BUT, the problem happens when you are in a "maybe_uri"-stage, but > > realize it's not. Then both the prefix and postfix text together
with Show quoted text
> > the maybe_uri part should be sent to the escape_func concatenated > > together. > > > > In my case a smiley "Am I allowed to :smile:? Haha." ends up as
passing Show quoted text
> > "Am I allowed to :", " Haha" to the escape func in turns. Even
passing Show quoted text
> > "Am I allowed to :", "smile:?", " Haha" in turns would be wrong (It > > wouldn't find ":smile:" anymore). > > > > I guess my suggestion would be to have some sort of return value
for Show quoted text
> > _uri_filter and also check the result of the recursive $self->find
call Show quoted text
> > before deciding to escape the parts separately, or as a whole
string. Show quoted text
> > Yes, I was thinking just that. > >
> > Would you want me to provide a patch, or do you prefer to look at
it Show quoted text
> > yourself?
> > I would love to have more people patching URI::Find. Knock yourself
out. Show quoted text
> Start with some tests and go from there. > > Traditional patches are fine, but preferably work in a fork on Github. > https://github.com/schwern/uri-find
I did clone the git repo, and made a diff locally, not sure how to do pull requests et all. Mind you the patch is quite extensive, as I really could not find an easier way to ensure the corner-cases. I'm really sorry for that, but I had to turn some thing up-side-down to work properly. All tests should pass including one new I added in filter.t. I don't know how the code works in perl 5.006, but tried to stay away of using external modules and exotic functions. If the speed is really a concern I could go over things better, but anyhow at leat oyu might get an idea of what might be needed to get the escape function to get the correct text. Cheers
Subject: uri-find.patch
diff --git a/lib/URI/Find.pm b/lib/URI/Find.pm index 7b4ba62..36f66e7 100644 --- a/lib/URI/Find.pm +++ b/lib/URI/Find.pm @@ -115,9 +115,60 @@ sub find { $self->{_uris_found} = 0; - # Don't assume http. + my @parts = $self->_get_parts($r_text); + + # Don't assume http (used in _uri_filter calling _is_uri) my $old_strict = URI::URL::strict(1); + my @reduced = shift @parts; + for my $part (@parts) { + next if ! defined $part || (ref $part && (!defined $part->{str} || $part->{str} eq '')); + + # Find out if part needs filtering, if not, make it a regular string + # ready for reducing, and later escape function + if (ref $part && $part->{state} eq 'filter') { + my $filtered = $self->_uri_filter($part->{str}); + if (defined $filtered) { + $part->{str} = $filtered; + $part->{state} = 'filtered'; + } + else { + $part = $part->{orig}; + } + } + + # Reduce part, ie. collapse strings that follow eachother so that escape + # function gets as whole strings as possible + if ( ref $part ) { + push @reduced, $part; + } + elsif ( ! ref $reduced[-1] ) { + $reduced[-1] .= $part; + } + else { + push @reduced, $part; + } + } + + URI::URL::strict($old_strict); + + my $replace = ""; + + foreach my $part (@reduced) { + next if ! defined $part; + $replace .= ref $part ? $part->{str} : $escape_func->($part); + } + + ${$r_text} = $replace; + + return $self->{_uris_found}; + +} + + +sub _get_parts { + my ($self, $r_text) = @_; + # Yes, evil. Basically, look for something vaguely resembling a URL, # then hand it off to URI::URL for examination. If it passes, throw # it to a callback and put the result in its place. @@ -127,17 +178,21 @@ sub find { my $uriRe = sprintf '(?:%s|%s)', $self->uri_re, $self->schemeless_uri_re; + my @parts = (); + $$r_text =~ s{ (.*?) (?:(<(?:URL:)?)(.+?)(>)|($uriRe)) | (.+?)$ }{ - my $replace = ''; + my $pre = $2; + my $post = $4; + if( defined $6 ) { - $replace = $escape_func->($6); + push @parts, $6; } else { my $maybe_uri = ''; - $replace = $escape_func->($1); + push @parts, $1; - if( defined $2 ) { + if( defined $pre ) { $maybe_uri = $3; my $is_uri = do { # Don't alter $1... $maybe_uri =~ s/\s+//g; @@ -145,9 +200,11 @@ sub find { }; if( $is_uri ) { - $replace .= $escape_func->($2); - $replace .= $self->_uri_filter($maybe_uri); - $replace .= $escape_func->($4); + push @parts, + $pre, + { str => $maybe_uri, orig => $3, set_at => __LINE__, state => 'filter' }, + $post + ; } else { # the whole text inside of the <...> was not a url, but @@ -156,29 +213,21 @@ sub find { $maybe_uri = $3; $maybe_uri =~ /$uriRe/; }; - if( $has_uri ) { - my $pre = $2; - my $post = $4; - do { $self->find(\$maybe_uri, $escape_func) }; - $replace .= $escape_func->($pre); - $replace .= $maybe_uri; - $replace .= $escape_func->($post); - } - else { - $replace .= $escape_func->($2.$3.$4); - } + push @parts, + $pre, + ( $has_uri ? do { $self->_get_parts(\$maybe_uri) } : $3 ), + $post, + ; } } else { - $replace .= $self->_uri_filter($5); + push @parts, { str => $5, orig => $5, set_at => __LINE__, state => 'filter' }; } } - - $replace; + ""; }gsex; - URI::URL::strict($old_strict); - return $self->{_uris_found}; + return @parts; } @@ -201,7 +250,7 @@ sub _uri_filter { } else { # False alarm - $replacement = $orig_match; + return; } # Return recrufted replacement @@ -526,7 +575,7 @@ Darren Chamberlain wrote urifind. Copyright 2000, 2009-2010 by Michael G Schwern E<lt>schwern@pobox.comE<gt>. -This program is free software; you can redistribute it and/or +This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See F<http://www.perlfoundation.org/artistic_license_1_0> diff --git a/t/filter.t b/t/filter.t index c7636e4..8a3be72 100644 --- a/t/filter.t +++ b/t/filter.t @@ -21,6 +21,7 @@ my @tasks = ( ["Foo&Bar\nhttp://abc.com.\nFoo&Bar", "Foo&amp;Bar\nxx&.\nFoo&amp;Bar"], ["Foo&Bar\nhttp://abc.com. http://def.com.\nFoo&Bar", "Foo&amp;Bar\nxx&. xx&.\nFoo&amp;Bar"], + ["noturi:& should also be escaped", "noturi:&amp; should also be escaped"], ); for my $task (@tasks) { @@ -33,7 +34,6 @@ for my $task (@tasks) { sub simple_escape { my($toencode) = @_; - $toencode =~ s{&}{&amp;}gso; return $toencode; }
I figured out a way to do it with minimal change by placing the escape function temporarily inside the object so it can be accessed by _uri_filter as needed. It covers the additional corner cases I could come up with and code paths I could trace. See https://github.com/schwern/uri-find/commit/713038724c4c04dee786ba4e0c6b84e2466a342e and https://github.com/schwern/uri-find/commit/4d5390fbe2d51033cf26a1eead141f694dd88ebd Let me know if you find more edge cases it doesn't handle. v20111020 has been released.
fr. 21. okt. 2011 02.36.08 skrev MSCHWERN: Show quoted text
> I figured out a way to do it with minimal change by placing the escape > function temporarily inside the object so it can be accessed by > _uri_filter as needed. It covers the additional corner cases I could > come up with and code paths I could trace. > > See > https://github.com/schwern/uri- > find/commit/713038724c4c04dee786ba4e0c6b84e2466a342e > and > https://github.com/schwern/uri- > find/commit/4d5390fbe2d51033cf26a1eead141f694dd88ebd > > Let me know if you find more edge cases it doesn't handle. > > v20111020 has been released.
Sorry no dice, exactly because of what I tried to explain, the escape function gets the parts separately: $ perl filter_full_string.t called with: how are you : called with: smiley:? not ok 1 - expand_smilies how are you :smiley:? # Failed test 'expand_smilies how are you :smiley:?' # at filter_full_string.t line 22. # got: 'how are you :smiley:?' # expected: 'how are you <img src='smiley.gif'>?' 1..1 # Looks like you failed 1 test of 1. test file: #!/usr/bin/perl -w # Test the filter function making sure it gets full strings back, not just split parts around urls that look like urls but aren't. use strict; use Test::More 'no_plan'; use URI::Find; my @tasks = ( # Thing which looks like a URL but isn't ["how are you :smiley:?", "how are you <img src='smiley.gif'>?"], ); for my $task (@tasks) { my($str, $result) = @$task; my $org = $str; my $f = URI::Find->new(sub { return "xx&" }); $f->find(\$str, \&expand_smilies); is($str, $result, "expand_smilies $org"); } sub expand_smilies { my($str) = @_; print "called with: $str\n"; $str =~ s{\W/:smiley:\W/}{<img src='smiley.gif'>}gmx; return $str; }
sø. 23. okt. 2011 18.10.23 skrev NICOMEN: Show quoted text
> fr. 21. okt. 2011 02.36.08 skrev MSCHWERN:
> > I figured out a way to do it with minimal change by placing the
escape Show quoted text
> > function temporarily inside the object so it can be accessed by > > _uri_filter as needed. It covers the additional corner cases I
could Show quoted text
> > come up with and code paths I could trace. > > > > See > > https://github.com/schwern/uri- > > find/commit/713038724c4c04dee786ba4e0c6b84e2466a342e > > and > > https://github.com/schwern/uri- > > find/commit/4d5390fbe2d51033cf26a1eead141f694dd88ebd > > > > Let me know if you find more edge cases it doesn't handle. > > > > v20111020 has been released.
> > Sorry no dice, exactly because of what I tried to explain, the escape > function gets the parts separately: > > $ perl filter_full_string.t > called with: how are you : > called with: smiley:? > not ok 1 - expand_smilies how are you :smiley:? > # Failed test 'expand_smilies how are you :smiley:?' > # at filter_full_string.t line 22. > # got: 'how are you :smiley:?' > # expected: 'how are you <img src='smiley.gif'>?' > 1..1 > # Looks like you failed 1 test of 1. > > test file: > > #!/usr/bin/perl -w > > # Test the filter function making sure it gets full strings back, not > just split parts around urls that look like urls but aren't. > > use strict; > > use Test::More 'no_plan'; > > use URI::Find; > > > my @tasks = ( > # Thing which looks like a URL but isn't > ["how are you :smiley:?", "how are you <img src='smiley.gif'>?"], > ); > > for my $task (@tasks) { > my($str, $result) = @$task; > my $org = $str; > my $f = URI::Find->new(sub { return "xx&" }); > $f->find(\$str, \&expand_smilies); > is($str, $result, "expand_smilies $org"); > } > > sub expand_smilies { > my($str) = @_; > print "called with: $str\n"; > $str =~ s{\W/:smiley:\W/}{<img src='smiley.gif'>}gmx; > return $str; > } > > >
Blah sorry, regexp should be: $str =~ s{(\W):(smiley):(\W)}{$1<img src='$2.gif'>$3}gmx; The point is at list as the function in this case should only be called once with: $ perl filter_full_string.t called with: how are you :smiley:? ok 1 - expand_smilies how are you :smiley:? 1..1 (Using the patch I provided here)
Good point. I'm not sure what to do about that, the original patch to fix it is almost a parallel system. What's probably necessary is to put the code in _uri_filter() that decrufts and does the final determination of it's really a URI back together with the escape function callbacks. As I see it, there's only one instance where this is necessary. if( $is_uri ) { $replace .= $escape_func->($2); $replace .= $self->_uri_filter($maybe_uri); $replace .= $escape_func->($4); }
to. 03. nov. 2011 20.23.47 skrev MSCHWERN: Show quoted text
> Good point.
I believe this has been my point since the beginning. Show quoted text
> I'm not sure what to do about that, the original patch to > fix it is almost a parallel system. >
And it works, and I believe it's faster too, and it does thing correctly. Show quoted text
> What's probably necessary is to put the code in _uri_filter() that > decrufts and does the final determination of it's really a URI back > together with the escape function callbacks. > > As I see it, there's only one instance where this is necessary. > > if( $is_uri ) { > $replace .= $escape_func->($2); > $replace .= $self->_uri_filter($maybe_uri); > $replace .= $escape_func->($4); > }
I don't think so, it's starting to be a while now, but believe me, I tried to patch this with as little impact as possible, and I don't think it was enough to just do something locally. I think it was because of the recursive call, and not knowing if parts or all of the result of that _uri_filter() call is a replaced uri, or not. Did you try the patch? It works nicely for us on http://my.opera.com and http://dev.opera.com, (I built our own .deb package containing your module with the patch applied.) Thanks for your time ;)