Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the CGI CPAN distribution.

Report information
The Basics
Id: 39122
Status: resolved
Priority: 0/
Queue: CGI

People
Owner: MARKSTOS [...] cpan.org
Requestors: GAMACHE [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: unescapeHTML method falsely recognizes certain text as entities
CGI.pm's unescapeHTML method, intended to translate certain SGML entities into the characters they represent, will recognize anything, even including whitespace, between an ampersand and a semicolon as an entity. This results in the ampersand and semicolon being stripped. For instance, the string: Dewey, Cheatham & Howe are liars; they stole my life savings! would be improperly unescaped to read: Dewey, Cheatham Howe are liars they stole my life savings! The fix is provided by a small change to the regex, converting it from a minimal match between ampersand and semicolon, to a minimal match of non-whitespace between ampersand and semicolon. A patch for this fix is included. -pete gamache tack-weld my last name to google's mail server.
Subject: cgi.patch
Common subdirectories: CGI.pm-3.42/CGI and CGI.pm-3.42-patched/CGI diff -c CGI.pm-3.42/CGI.pm CGI.pm-3.42-patched/CGI.pm *** CGI.pm-3.42/CGI.pm Mon Sep 8 10:13:23 2008 --- CGI.pm-3.42-patched/CGI.pm Tue Sep 9 11:10:51 2008 *************** *** 2235,2241 **** my $latin = defined $self->{'.charset'} ? $self->{'.charset'} =~ /^(ISO-8859-1|WINDOWS-1252)$/i : 1; # thanks to Randal Schwartz for the correct solution to this one ! $string=~ s[&(.*?);]{ local $_ = $1; /^amp$/i ? "&" : /^quot$/i ? '"' : --- 2235,2241 ---- my $latin = defined $self->{'.charset'} ? $self->{'.charset'} =~ /^(ISO-8859-1|WINDOWS-1252)$/i : 1; # thanks to Randal Schwartz for the correct solution to this one ! $string=~ s[&(\W*?);]{ local $_ = $1; /^amp$/i ? "&" : /^quot$/i ? '"' : Common subdirectories: CGI.pm-3.42/examples and CGI.pm-3.42-patched/examples Common subdirectories: CGI.pm-3.42/t and CGI.pm-3.42-patched/t
On Tue Sep 09 11:15:09 2008, GAMACHE wrote: Show quoted text
> CGI.pm's unescapeHTML method, intended to translate certain SGML > entities into the characters they represent, will recognize anything, > even including whitespace, between an ampersand and a semicolon as an > entity. This results in the ampersand and semicolon being stripped. > For instance, the string: > > Dewey, Cheatham & Howe are liars; they stole my life savings! > > would be improperly unescaped to read: > > Dewey, Cheatham Howe are liars they stole my life savings! > > The fix is provided by a small change to the regex, converting it from a > minimal match between ampersand and semicolon, to a minimal match of > non-whitespace between ampersand and semicolon. A patch for this fix is > included.
Pete, Thanks for the patch. I have peer-reviewed and believe it should be accepted. Could you also submit a simple automated test which illustrates the bug? Mark
On Tue Sep 09 11:15:09 2008, GAMACHE wrote: Show quoted text
> CGI.pm's unescapeHTML method, intended to translate certain SGML > entities into the characters they represent, will recognize anything, > even including whitespace, between an ampersand and a semicolon as an > entity. This results in the ampersand and semicolon being stripped. > For instance, the string: > > Dewey, Cheatham & Howe are liars; they stole my life savings! > > would be improperly unescaped to read: > > Dewey, Cheatham Howe are liars they stole my life savings! > > The fix is provided by a small change to the regex, converting it from a > minimal match between ampersand and semicolon, to a minimal match of > non-whitespace between ampersand and semicolon. A patch for this fix is > included. > > -pete gamache > tack-weld my last name to google's mail server.
Pete, Automating testing reveals that the proposed patch breaks the cases which are supposed to work: is( unescapeHTML( '&'), '&', 'unescapeHTML: &'); is( unescapeHTML( '"'), '"', 'unescapeHTML: "'); is( unescapeHTML( 'Bob & Tom went to the store; Where did you go?'), 'Bob & Tom went to the store; Where did you go?', 'unescapeHTML: a case where &...; should not be escaped.'); Applying the patch makes the third test start pass, but the first two start to fail. Please re-submit a patch which makes all tests pass. More tests to other possible HTML escaping cases would be ideal as well.
Attached is a patch to the unescapeHTML method that will not match whitespace characters and that moves the third out of the TODO block. I also have this on my CGI.pm fork at http://github.com/bubaflub/CGI.pm/tree/master which you can pull from if that's easier. Show quoted text
> > > Applying the patch makes the third test start pass, but the first two > start to fail. Please re-submit a patch which makes all tests pass. More > tests to other possible HTML escaping cases would be ideal as well. > > > > >
From ccec3724afe9daa864bd6f285478fc8e19296eda Mon Sep 17 00:00:00 2001 From: Bob Kuo <bob@celect.org> Date: Sat, 15 Aug 2009 19:04:13 -0500 Subject: [PATCH] fixes RT #39122 - unescapeHTML recognizing some text as entites --- lib/CGI.pm | 2 +- t/unescapeHTML.t | 7 ++----- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/lib/CGI.pm b/lib/CGI.pm index 775871c..ec6e8ad 100644 --- a/lib/CGI.pm +++ b/lib/CGI.pm @@ -2295,7 +2295,7 @@ sub unescapeHTML { my $latin = defined $self->{'.charset'} ? $self->{'.charset'} =~ /^(ISO-8859-1|WINDOWS-1252)$/i : 1; # thanks to Randal Schwartz for the correct solution to this one - $string=~ s[&(.*?);]{ + $string=~ s[&(\S*?);]{ local $_ = $1; /^amp$/i ? "&" : /^quot$/i ? '"' : diff --git a/t/unescapeHTML.t b/t/unescapeHTML.t index fc0f750..4cc5386 100644 --- a/t/unescapeHTML.t +++ b/t/unescapeHTML.t @@ -4,8 +4,5 @@ use CGI 'unescapeHTML'; is( unescapeHTML( '&amp;'), '&', 'unescapeHTML: &'); is( unescapeHTML( '&quot;'), '"', 'unescapeHTML: "'); -TODO: { - local $TODO = 'waiting on patch. Reference: https://rt.cpan.org/Ticket/Display.html?id=39122'; - is( unescapeHTML( 'Bob & Tom went to the store; Where did you go?'), - 'Bob & Tom went to the store; Where did you go?', 'unescapeHTML: a case where &...; should not be escaped.'); -} +is( unescapeHTML( 'Bob & Tom went to the store; Where did you go?'), + 'Bob & Tom went to the store; Where did you go?', 'unescapeHTML: a case where &...; should not be escaped.'); -- 1.6.3.1
Thanks. This is now patched in my github repo now, with credit to you.
Subject: Thanks, released
The patch for this ticket has now been released in CGI.pm 3.47, and this ticket is considered resolved. Thanks again for you help to improve CGI.pm! Mark