Bug #115339 for URI: HTTP::Response->base does not work with data: URLs

Tue Jun 14 20:55:16 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Ticket created

Subject:

HTTP::Response->base does not work with data: URLs

The easiest way to demonstrate the problem is with LWP: $ perl -MLWP::UserAgent -le '$ua = new LWP::UserAgent; print $ua->get("data:text/plain,foo")->base;' data: HTTP::Response::base calls $uri->base($foo), which nothing with data URLs. I don’t fully understand exactly what is supposed to be calling what, but I believe the problem lies in the URI somewhere. (Apologies if this is in the wrong queue.) It is necessary for this to work, since the correct full URL for a link to an anchor should be obtainable via $response->base . "#anchor", but it fails with data:foo.

Wed Jun 15 01:13:36 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Tue Jun 14 20:55:16 2016, SPROUT wrote: Show quoted text

> The easiest way to demonstrate the problem is with LWP: > > $ perl -MLWP::UserAgent -le '$ua = new LWP::UserAgent; print $ua-

> >get("data:text/plain,foo")->base;'

> data: > > HTTP::Response::base calls $uri->base($foo), which nothing with data > URLs. I don’t fully understand exactly what is supposed to be calling > what, but I believe the problem lies in the URI somewhere. (Apologies > if this is in the wrong queue.) > > It is necessary for this to work, since the correct full URL for a > link to an anchor should be obtainable via $response->base . > "#anchor", but it fails with data:foo.

Actually, this is definitely a problem with URI.pm. A simpler example fails: $ perl5.24.0 -Ilib -le 'use URI; print new_abs URI "#anchor", "data:text/html,foo"' data:#anchor Expected output: data:text/html,foo#anchor

Wed Jun 15 01:13:36 2016 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 15 01:54:20 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Wed Jun 15 01:13:36 2016, SPROUT wrote: Show quoted text

> $ perl5.24.0 -Ilib -le 'use URI; print new_abs URI "#anchor", > "data:text/html,foo"' > data:#anchor > > Expected output: > data:text/html,foo#anchor

Here is a patch. It is based on observing how web browsers behave.

Subject:

open_COpeNf03.txt

diff -Nurp URI-1.71-4MBcyf-orig/lib/URI/data.pm URI-1.71-xWPqJc/lib/URI/data.pm --- URI-1.71-4MBcyf-orig/lib/URI/data.pm 2016-01-07 19:15:54.000000000 -0800 +++ URI-1.71-xWPqJc/lib/URI/data.pm 2016-06-14 22:49:37.000000000 -0700 @@ -11,6 +11,11 @@ $VERSION = eval $VERSION; use MIME::Base64 qw(encode_base64 decode_base64); use URI::Escape qw(uri_unescape); +# No scheme is ok only if the URL is a fragment. +sub _no_scheme_ok { + defined $_[1] && $_[1] =~ /^#/; +} + sub media_type { my $self = shift; @@ -74,6 +79,23 @@ sub _uric_count } EOT +sub abs +{ + my $self = shift; + my $base = shift || Carp::croak("Missing base argument"); + + if ($self->scheme) { + return $self; + } + + # The only relative data: URLs permitted are bare fragments, so $self + # must be one of those. + + my $abs = ref $base ? $base->clone : URI->new($base); + $abs->fragment($self->fragment); + $abs; +} + 1; __END__ diff -Nurp URI-1.71-4MBcyf-orig/lib/URI.pm URI-1.71-xWPqJc/lib/URI.pm --- URI-1.71-4MBcyf-orig/lib/URI.pm 2016-01-07 19:15:54.000000000 -0800 +++ URI-1.71-xWPqJc/lib/URI.pm 2016-06-14 22:34:27.000000000 -0700 @@ -81,7 +81,7 @@ sub _init # find all funny characters and encode the bytes. $str = $class->_uric_escape($str); $str = "$scheme:$str" unless $str =~ /^$scheme_re:/o || - $class->_no_scheme_ok; + $class->_no_scheme_ok($str); my $self = bless \$str, $class; $self; } diff -Nurp URI-1.71-4MBcyf-orig/t/data.t URI-1.71-xWPqJc/t/data.t --- URI-1.71-4MBcyf-orig/t/data.t 2015-06-25 19:05:48.000000000 -0700 +++ URI-1.71-xWPqJc/t/data.t 2016-06-14 22:36:06.000000000 -0700 @@ -10,7 +10,7 @@ if ($@) { exit; } -print "1..22\n"; +print "1..25\n"; use URI; @@ -109,3 +109,14 @@ print "ok 21\n"; print "not " unless URI->new('data:;base64,%51%6D%70%76%5A%58%4A%75')->data eq "Bjoern"; print "ok 22\n"; + +print "not " unless URI->new('foo', "data") =~ "^data:"; +print "ok 23\n"; + +# Bare fragments are valid data: URLs +print "not " unless URI->new("#anchor", 'data') eq "#anchor"; +print "ok 24\n"; + +print "not " unless URI->new_abs("#anchor", "data:text/html,foo") + eq "data:text/html,foo#anchor"; +print "ok 25\n";

Wed Jun 15 02:11:43 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Wed Jun 15 01:54:20 2016, SPROUT wrote: Show quoted text

> On Wed Jun 15 01:13:36 2016, SPROUT wrote:

> > $ perl5.24.0 -Ilib -le 'use URI; print new_abs URI "#anchor", > > "data:text/html,foo"' > > data:#anchor > > > > Expected output: > > data:text/html,foo#anchor

> > Here is a patch. It is based on observing how web browsers behave.

Wait. This patch doesn’t fully solve the problem. Despite what I said in my original post (some was wrong, based on reading the source of multiple modules and not remembering which was which), HTTP::Response::base does this: # if $base is undef here, the return value is effectively # just a copy of $self->request->uri. return $HTTP::URI_CLASS->new_abs($base, $req->uri); which translates into URI->new_abs(undef, "data:,foo"). But that gives ‘data:’, which is not just a copy of "data:,foo". URI::new_abs is as follows: sub new_abs { my($class, $uri, $base) = @_; $uri = $class->new($uri, $base); $uri->abs($base); } So it does URI->new(undef, "data:,foo") in this case, which is interpreted as ->new(undef, "data"), since the second argument is just the scheme. That translates into data:, since you cannot have an empty data: URL. (It doesn’t make sense.) So it seems URI->new("data:")->abs("data:,foo") needs to treat "data:" as undef and return "data:,foo". This new patch follows that approach. I can’t say whether it is the best solution (after all, even web browsers don’t behave the same way when a data: URL contains <a href=foo>), but it works for all the cases I can think of that do make sense.

Subject:

new patch.txt

diff -Nurp URI-1.71-4MBcyf-orig/lib/URI/data.pm URI-1.71-xWPqJc/lib/URI/data.pm --- URI-1.71-4MBcyf-orig/lib/URI/data.pm 2016-01-07 19:15:54.000000000 -0800 +++ URI-1.71-xWPqJc/lib/URI/data.pm 2016-06-14 23:10:50.000000000 -0700 @@ -11,6 +11,11 @@ $VERSION = eval $VERSION; use MIME::Base64 qw(encode_base64 decode_base64); use URI::Escape qw(uri_unescape); +# No scheme is ok only if the URL is a fragment. +sub _no_scheme_ok { + defined $_[1] && $_[1] =~ /^#/; +} + sub media_type { my $self = shift; @@ -74,6 +79,23 @@ sub _uric_count } EOT +sub abs +{ + my $self = shift; + my $base = shift || Carp::croak("Missing base argument"); + + if ($self->scheme && lc $$self ne 'data:') { + return $self; + } + + # The only relative data: URLs permitted are bare fragments, so $self + # must be one of those or a plain "data:" with no content. + + my $abs = ref $base ? $base->clone : URI->new($base); + $abs->fragment($self->fragment); + $abs; +} + 1; __END__ diff -Nurp URI-1.71-4MBcyf-orig/lib/URI.pm URI-1.71-xWPqJc/lib/URI.pm --- URI-1.71-4MBcyf-orig/lib/URI.pm 2016-01-07 19:15:54.000000000 -0800 +++ URI-1.71-xWPqJc/lib/URI.pm 2016-06-14 22:34:27.000000000 -0700 @@ -81,7 +81,7 @@ sub _init # find all funny characters and encode the bytes. $str = $class->_uric_escape($str); $str = "$scheme:$str" unless $str =~ /^$scheme_re:/o || - $class->_no_scheme_ok; + $class->_no_scheme_ok($str); my $self = bless \$str, $class; $self; } diff -Nurp URI-1.71-4MBcyf-orig/t/data.t URI-1.71-xWPqJc/t/data.t --- URI-1.71-4MBcyf-orig/t/data.t 2015-06-25 19:05:48.000000000 -0700 +++ URI-1.71-xWPqJc/t/data.t 2016-06-14 23:09:03.000000000 -0700 @@ -10,7 +10,7 @@ if ($@) { exit; } -print "1..22\n"; +print "1..26\n"; use URI; @@ -109,3 +109,18 @@ print "ok 21\n"; print "not " unless URI->new('data:;base64,%51%6D%70%76%5A%58%4A%75')->data eq "Bjoern"; print "ok 22\n"; + +print "not " unless URI->new('foo', "data") =~ "^data:"; +print "ok 23\n"; + +# Bare fragments are valid data: URLs +print "not " unless URI->new("#anchor", 'data') eq "#anchor"; +print "ok 24\n"; + +print "not " unless URI->new_abs("#anchor", "data:text/html,foo") + eq "data:text/html,foo#anchor"; +print "ok 25\n"; + +print "not " unless URI->new_abs(undef, "data:text/html,foo") + eq "data:text/html,foo"; +print "ok 26\n";