Bug #61120 for CGI: redirect removes "&" and ";" from query strings

Mon Sep 06 13:02:08 2010 GARGAMEL [...] cpan.org - Ticket created

Subject:

redirect removes "&" and ";" from query strings

$ perl -MCGI -e '$q = new CGI; print $q->redirect("http://example.invalid/?entities_detection:&any_non_whitespace;results_in")' Status: 302 Moved Location: http://example.invalid/?entities_detection:any_non_whitespaceresults_in It seems sub unescapeHTML { ... } strips "&" and ";" if there is any non-whitespace-string inbetween.

Tue Sep 07 09:16:18 2010 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #61120] redirect removes "&" and ";" from query strings
Date:	Tue, 7 Sep 2010 09:15:59 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

On Mon, 6 Sep 2010 13:02:09 -0400 "Karlheinz Zoechling via RT" <bug-CGI.pm@rt.cpan.org> wrote: Show quoted text

> Mon Sep 06 13:02:08 2010: Request 61120 was acted upon. > Transaction: Ticket created by GARGAMEL > Queue: CGI.pm > Subject: redirect removes "&" and ";" from query strings > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: GARGAMEL@cpan.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=61120 > > > > $ perl -MCGI -e '$q = new CGI; print > $q->redirect("http://example.invalid/?entities_detection:&any_non_whitespace;results_in")' > Status: 302 Moved > Location: > http://example.invalid/?entities_detection:any_non_whitespaceresults_in > > > It seems sub unescapeHTML { ... } strips "&" and ";" if there is any > non-whitespace-string inbetween.

A suggestion for how unescapeHTML should work differently would be welcome. Looking at other modules that do HTML escaping/unescaping could give you inspiration. Mark

Tue Sep 07 09:16:21 2010 The RT System itself - Status changed from 'new' to 'open'

Wed Sep 08 12:51:54 2010 yanick.champoux [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #61120] redirect removes "&" and ";" from query strings
Date:	Wed, 08 Sep 2010 12:51:44 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Yanick Champoux <yanick.champoux [...] gmail.com>

I've looked at the code of unescapeHTML. What happens is that the function find instances of &<something>; and unescape them to their rightful values. So '&' becomes '&', '<' becomes '<', and '&any_non_whitespace;' becomes... oh darn. Would it make sense that, if we don't recognize the <something>, we leave the &<something>; unchanged in the string? `/.

Wed Sep 08 15:13:30 2010 MARKSTOS [...] cpan.org - Reference to ticket #39122 added

Wed Sep 08 15:16:03 2010 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #61120] redirect removes "&" and ";" from query strings
Date:	Wed, 8 Sep 2010 15:15:54 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Thanks for jumping in, Yanick. Show quoted text

> I've looked at the code of unescapeHTML. What happens is that the > function find instances of &<something>; and unescape them to their > rightful values. So '&' becomes '&', '<' becomes '<', and > '&any_non_whitespace;' becomes... oh darn.

For reference, we improved this situation in RT#39122. Before that, it matched on whitespace, too. I've linked these tickets now. Show quoted text

> Would it make sense that, if we don't recognize the <something>, we > leave the &<something>; unchanged in the string?

Sort of. I think that means that CGI.pm needs to know and maintain every possible HTML entity. That sounds like a pain. Could someone check how some other HTML unescaping CPAN modules solve this? Mark

Wed Sep 08 19:20:47 2010 GARGAMEL [...] cpan.org - Correspondence added

Mark, Show quoted text

> A suggestion for how unescapeHTML should work differently would be > welcome. Looking at other modules that do HTML escaping/unescaping > could give you inspiration.

HTML::Entities, probably the most widely used Module for escaping/unescaping, uses a list of named entities and their corresponding code points. Unrecognized entities are left alone. As I understand the function unescapeHTML, it knows 2 named entities, which it handles as intended, and numeric entities, which it replaces with their correspoding chr(). IMO a useful fix would be to keep this behavior, but leave everything else alone, without stripping the "&" and ";" as it currently does in these cases. However, and I am aware that this might probably open a can of worms, I was surprised that ->redirect() alters the supplied URL at all. This was not the behavior I expected. I do not know why this is done, maybe RFC-compliance or some other reason? I am not asking for changing this behavior, because it would break backwards compatibility, but I think that a) it should be documented and b) there should be a way to turn it off and use the URL "as is" - maybe with a flag in the method call, or a different method. This for both the HTML- und entity-escaping. Even if the reason is RFC-compliance, there are so many companies around that use and require messy query strings, that strictly following the RFC means trouble. For instance, I ran into this problem with URLs from a big german affiliate agency. I could work around that in my case, but that was pure chance. Karlheinz

Sat Nov 20 17:24:04 2010 MARKSTOS [...] cpan.org - Taken

Sat Nov 20 17:40:40 2010 MARKSTOS [...] cpan.org - Correspondence added

RT-Send-CC:

yanick-pan [...] babyl.dyndns.org, lincoln.stein [...] gmail.com

I've looked into this bug further. The crux of it has to do with this line in sub redirect () {} return $self->header((map {$self->unescapeHTML($_)} @o),@unescaped); Some of the values passed to redirect() are being run through unescapeHTML. I'm stumped by this, because redirect() should not receiving HTML as input, nor does it generate HTML as output (it generates an HTTP header). It looks to me like the call to unescapeHTML() should not be there, and should be removed. I looked some at Mojo and HTTP::Headers, and neither of these do anything with HTML-escaping when generating headers. Further, I can't how the current behavior would be considered helping in working towards a spec. The unescapeHTML behavior was added in version 2.90 with the Changelog entry of simply "1. Fixed bug in redirect header handling.". Originally, it ran unescapeHTML on *all* values. Soon it was found that this behavior corrupted cookies, so it was partially backed-out in the 2.92 release, when it no longer applied too cookies. I think now we are seeing that it can also corrupt URLs in some cases, and should be removed. All that said, redirect() appears that it would not have an issue if there were also not a bug in how unescapeHTML handles "&foo;" sequences when it doesn't recognize the the entity name. Yanick, Lincoln, comments?

Thu May 22 08:08:39 2014 LEEJO [...] cpan.org - Correspondence added

This issue has been copied to: https://github.com/leejo/CGI.pm/issues/75 please take all future correspondence there. This ticket will remain open but please do not reply here. This ticket will be closed when the github issue is dealt with.

Fri May 23 14:28:06 2014 The RT System itself - Queue changed from CGI.pm to CGI

Sat Sep 20 09:03:19 2014 LEEJO [...] cpan.org - Correspondence added

This appears to have been fixed by 0160a4f. I have added a test case for this specific example and it is passing. Closing.

Sat Sep 20 09:03:20 2014 LEEJO [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #61120 for CGI: redirect removes "&" and ";" from query strings

Preferred bug tracker