Bug #83265 for CGI: url() incorrectly detects beginning of query string by decoding request

Mon Feb 11 10:29:41 2013 rrt [...] sc3d.org - Ticket created

Subject:

url() incorrectly unescapes request_uri

My CGI script gets a request such as: GET /~rrt/Software/DarkGlass/Why%20DarkGlass%3F HTTP/1.1 Note the "%3F" at the end of the URL: it's an escaped question mark. My script then asks for the URL using "url()", and gets: "Software/DarkGlass/Why DarkGlass" The question mark has been stripped off, because after unescaping in the line: my $request_uri = unescape($self->request_uri) || ''; the code then strips the query off: my $uri = $rewrite && $request_uri ? $request_uri : $script_name; $uri =~ s/\?.*$//s; # remove query string But this is wrong: if a query string is passed, it uses an unescaped question mark. The "unescape" routine is itself undocumented, being part of CGI::Util which is for internal use only. I'm not an expert, so I don't know if some unescaping should be being done, but it seems that at the very least, question marks should not be unescaped at that point. However, since the question mark denoting a query string should be a literal question mark character in the original request URI, it seems to be that it should be safe to remove the query string before unescaping the URI. The code in question has not been changed since the start of CGI.pm's git history, so I can't find any relevant commit logs. If I'm correct, I'm still not exactly sure what a minimum patch would be, as while the obvious thing to do would be to move the call of unescape to after the stripping of the query string, that would mean applying unescape to the value of $script_name in some circumstances, which might be OK, or might not, I don't know. But I guess $script_name shouldn't contain a query string, so in fact, stripping the query string should only be needed in the case that $uri is set to $request_uri, so it should be possible to rewrite the relevant lines from: my $request_uri = unescape($self->request_uri) || ''; my $query_str = $self->query_string; my $rewrite_in_use = $request_uri && $request_uri !~ /^\Q$script_name/; my $uri = $rewrite && $request_uri ? $request_uri : $script_name; $uri =~ s/\?.*$//s; # remove query string to: my $request_uri = $self->request_uri || ''; my $query_str = $self->query_string; $request_uri =~ s/\?.*$//s; # remove query string $request_uri = unescape($request_uri); my $rewrite_in_use = $request_uri && $request_uri !~ /^\Q$script_name/; my $uri = $rewrite && $request_uri ? $request_uri: $script_name;

Mon Feb 11 10:42:12 2013 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #83265] url() incorrectly unescapes request_uri
Date:	Mon, 11 Feb 2013 10:41:47 -0500
To:	bug-CGI [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Show quoted text

> My CGI script gets a request such as: > > GET /~rrt/Software/DarkGlass/Why%20DarkGlass%3F HTTP/1.1 > > Note the "%3F" at the end of the URL: it's an escaped question mark. > > My script then asks for the URL using "url()", and gets: > > "Software/DarkGlass/Why DarkGlass" > > The question mark has been stripped off

In this article, I describe a number of Perl solutions for handling URI escaping and unescaping: http://mark.stosberg.com/blog/2010/12/percent-encoding-uris-in-perl.html Why not run your sample URI through a few of them and see how their fare in handling encoded question marks? I think that would help clarify the standard and common behaviors. Mark

Mon Feb 11 10:42:14 2013 The RT System itself - Status changed from 'new' to 'open'

Mon Feb 11 10:49:12 2013 rrt [...] sc3d.org - Correspondence added

From:

rrt [...] sc3d.org

On Mon Feb 11 10:42:12 2013, mark@summersault.com wrote: Show quoted text

> > My CGI script gets a request such as: > > > > GET /~rrt/Software/DarkGlass/Why%20DarkGlass%3F HTTP/1.1 > > > > Note the "%3F" at the end of the URL: it's an escaped question mark. > > > > My script then asks for the URL using "url()", and gets: > > > > "Software/DarkGlass/Why DarkGlass" > > > > The question mark has been stripped off

> > In this article, I describe a number of Perl solutions for handling URI > escaping and unescaping: > > http://mark.stosberg.com/blog/2010/12/percent-encoding-uris-in-perl.html > > Why not run your sample URI through a few of them and see how their fare > in handling encoded question marks? > > I think that would help clarify the standard and common behaviors.

Possibly I wasn't clear: I don't have a problem with the way unescape works, it's just being applied too early.

Mon Feb 11 14:34:17 2013 MARKSTOS [...] cpan.org - Correspondence added

Please provide a new failing automated test for the test suite which showcases the problem.

Mon Feb 11 14:45:17 2013 rrt [...] sc3d.org - Correspondence added

From:

rrt [...] sc3d.org

On Mon Feb 11 14:34:17 2013, MARKSTOS wrote: Show quoted text

> Please provide a new failing automated test for the test suite which > showcases the problem.

Before I do that, I'd like to be confident I've understood the problem, since I'm no expert at this stuff. As it happens, I think the problem and the fix may be simpler than I previously thought: I don't think that url() should call unescape at all. The routine reads a URI from either the $self->request_url or the $script_name; it then returns a URL. So, what is it doing unescaping at all? If you agree with my analysis, then the fix is simply to remove the call to "unescape", and a failing test should be easy to write.

Mon Feb 11 15:45:08 2013 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #83265] url() incorrectly unescapes request_uri
Date:	Mon, 11 Feb 2013 15:44:41 -0500
To:	bug-CGI [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Show quoted text

> Before I do that, I'd like to be confident I've understood the problem, > since I'm no expert at this stuff. As it happens, I think the problem > and the fix may be simpler than I previously thought: I don't think that > url() should call unescape at all. The routine reads a URI from either > the $self->request_url or the $script_name; it then returns a URL. So, > what is it doing unescaping at all?

The authority on the issue would be the relevant RFCs for the CGI and HTTP protocols. I linked to the relevant resources through the article I liked above. My recollection at the moment is that that is correct for the request to be URI-unescaped as it comes it. If it was wrong to apply unescaping here, I think a few more people would have noticed in the 10+ years CGI.pm has been in use. I think it's far more likely that there's an edge case with handling encoded question marks in URIs. The other way to increase your confidence is to apply the test you develop against the other modules I reference in the article above. If all other solutions but CGI.pm produce the same result, you can be certain there's a bug in CGI.pm. Mark

Mon Feb 11 16:25:03 2013 rrt [...] sc3d.org - Correspondence added

From:

rrt [...] sc3d.org

On Mon Feb 11 15:45:08 2013, mark@summersault.com wrote: Show quoted text

> > > The authority on the issue would be the relevant RFCs for the CGI and > HTTP protocols. I linked to the relevant resources through the article I > liked above.

As far as I can tell, that article only links to RFCs that define URLs and URIs, not the RFCs for CGI and HTTP. I looked at RFC3875, the definition of CGI 1.1: http://www.ietf.org/rfc/rfc3875 It says: In all cases, the string is first processed with regard to any reserved characters present, and then the resulting data can be URL-decoded by replacing "%" escape sequences by their character values. This suggests that at least the first part of my interpretation is correct, namely, the URL should not be unescaped until after the query string has been removed. Show quoted text

> If it was wrong to apply unescaping > here, I think a few more people would have noticed in the 10+ years > CGI.pm has been in use.

I'm not so sure. First, the url() function/method to which I refer is not itself strictly covered by any spec: it's a consumer of the CGI interface, it doesn't implement it. Therefore, what it does is largely up to you. Secondly, I've had a FIXME sitting in my code for some years mentioning that '+' characters seem to be incorrectly unescaped, and wondering whether it's a bug in CGI.pm, but I never got around to investigating further until now. However, the url() method does claim to return a URL, and at present it seems it doesn't really do this, as it really returns a URL-decoded string. RFC 3986 (pointed to by your artice) says: "When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters." That seems to say pretty much the same as the CGI RFC, i.e. CGI.pm shouldn't unescape before taking the URL apart. It goes on to say: "The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time." But the unescape function also decodes characters in the reserved set, so it can't be used when it is. Finally, supporting my assertion that the URL method should not percent-decode at all, RFC3986 says: "Once produced, a URI is always in its percent-encoded form." Since the url() method/function returns a URL/URI, it shouldn't percent-decode it. (BTW, I agree with your assertion that it's best to use the terms "percent-decode" and "percent-encode", and I'll try to use them and them alone in future!) Show quoted text

> The other way to increase your confidence is to apply the test you > develop against the other modules I reference in the article above. If > all other solutions but CGI.pm produce the same result, you can be > certain there's a bug in CGI.pm.

I'm sorry, I don't really understand this suggestion. As far as I can tell, the "other modules" you refer to are escaping/unescaping modules. I am suggesting that the URL should not be unescaped: i.e. I would not call any of those modules. I don't see how not using a different module from the one I'm currently not using would change things. Apologies if I'm being dense here!

Mon Feb 11 16:35:44 2013 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #83265] url() incorrectly unescapes request_uri
Date:	Mon, 11 Feb 2013 16:35:15 -0500
To:	bug-cgi [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Thanks for your help, and the detailed follow-up research and references. I'll take look.

Fri Feb 15 08:45:07 2013 MARKSTOS [...] cpan.org - Subject changed from 'url() incorrectly unescapes request_uri' to 'url() incorrectly detects beginning of query string by decoding request_uri'

Fri Feb 15 09:11:18 2013 MARKSTOS [...] cpan.org - Taken

Fri Feb 15 10:00:30 2013 MARKSTOS [...] cpan.org - Correspondence added

I looked into this further. I found that REQUEST_URI is not actually defined in the CGI specification. However, it was returned by Apache. I suppose the spec for the expected contents of it is "whatever Apache provides". Here's some actual responses from Apache 2.2 when given URLs: Base case. Nothing interesting here: ----------------------------------- Browser URL: /file?a=b 'QUERY_STRING' => 'a=b', 'SCRIPT_URL' => '/file', 'SCRIPT_URI' => 'http://example.com/file', 'REQUEST_URI' => '/file?a=b', Including query string after unencoded '?' ------------------------------------------- This case looked suspicious because of '&' appended to the query string, which led me to the third cas. Browser URL: /file%3F?a=b 'QUERY_STRING' => '&a=b', 'SCRIPT_URL' => '/file?', 'SCRIPT_URI' => 'http://example.com/file?', 'REQUEST_URI' => '/file%3F?a=b', Query String elements after encoded and unencoded '?' ---------------------------------------------------- Whoa. Apache is behaving inconsistently here. According to 'QUERY_STRING', the query string starts the encoded question mark, but according to the SCRIPT_URL and SCRIPT_URI, it starts only after the unencoded question mark. Browser URL: /file%3Fa=b?c=d 'QUERY_STRING' => 'a=b&c=d', 'SCRIPT_URL' => '/file?a=b', 'SCRIPT_URI' => 'http://www.mark.net.adoptapet.com/file?a=b', 'REQUEST_URI' => '/file%3Fa=b?c=d', I'll discuss this inconsistenncy with Apache developers. Mark

Fri Feb 15 10:04:37 2013 rrt [...] sc3d.org - Correspondence added

From:

rrt [...] sc3d.org

On Fri Feb 15 10:00:30 2013, MARKSTOS wrote: Show quoted text

> I looked into this further.

Thanks very much for looking into this. Good luck with the Apache devs!

Fri Feb 15 12:34:31 2013 MARKSTOS [...] cpan.org - Correspondence added

Here's my post to the Apache developers list: https://mail-archives.apache.org/mod_mbox/httpd- dev/201302.mbox/%3Ckflkpv%24qnr%241%40ger.gmane.org%3E

Tue Aug 20 14:29:01 2013 MARKSTOS [...] cpan.org - Correspondence added

On Fri Feb 15 12:34:31 2013, MARKSTOS wrote: Show quoted text

> Here's my post to the Apache developers list: > > https://mail-archives.apache.org/mod_mbox/httpd- > dev/201302.mbox/%3Ckflkpv%24qnr%241%40ger.gmane.org%3E

The Apache devs clarified that there was a bug with my RewriteRule. I should have added a [B] flag. However, CGI.pm has a related bug, in that it decodes the REQUEST_URI before looking for the query string, so it will return wrong results when url() is called for a case like this, even when Rewrite is properly configured. It appears that was your original conclusion, and I see you've provided a proposed patch for it. Could you prepare some automated tests for it as well, and see if fixing this breaks anything else in the test suite? Mark

Tue Aug 20 18:53:18 2013 rrt [...] sc3d.org - Correspondence added

Subject:	Re: [rt.cpan.org #83265] url() incorrectly detects beginning of query string by decoding request_uri
Date:	Tue, 20 Aug 2013 23:52:51 +0100
To:	bug-CGI [...] rt.cpan.org
From:	Reuben Thomas <rrt [...] sc3d.org>

On 20 August 2013 19:29, Mark Stosberg via RT <bug-CGI@rt.cpan.org> wrote: Show quoted text

> The Apache devs clarified that there was a bug with my RewriteRule. I > should have added a [B] flag. > > However, CGI.pm has a related bug, in that it decodes the > REQUEST_URI before looking for the query string, so it will return wrong > results when url() is called for a case like this, even when Rewrite is > properly configured. > > It appears that was your original conclusion, and I see you've provided a > proposed patch for it. Could you prepare some automated tests for it as > well, and see if fixing this breaks anything else in the test suite? >

Thanks very much for following this up. Looking back at the bug history, I see I proposed two patches: one which moved the call to "unescape", and one which removed the call altogether (and it was the second one I ended up considering correct). Could you please confirm which behaviour you think is the right one, and why? -- http://rrt.sc3d.org

Thu May 22 08:17:39 2014 LEEJO [...] cpan.org - Correspondence added

This issue has been copied to: https://github.com/leejo/CGI.pm/issues/112 please take all future correspondence there. This ticket will remain open but please do not reply here. This ticket will be closed when the github issue is dealt with.

Sat Sep 20 11:08:34 2014 LEEJO [...] cpan.org - Correspondence added

commit b4c99a9036d50c1894ae9e2920a387abd85f790f Author: Lee Johnson <lee@givengain.ch> Date: Sat Sep 20 16:07:00 2014 +0100 resolve #112 [rt.cpan.org #83265] - strip query string first before calling unescape otherwise we incorrectly remove any previously encoded ? chars in the URI Changes | 3 +++ lib/CGI.pm | 8 ++++---- 2 files changed, 7 insertions(+), 4 deletions(-)

Sat Sep 20 11:08:35 2014 LEEJO [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #83265 for CGI: url() incorrectly detects beginning of query string by decoding request_uri

Preferred bug tracker