Bug #25499 for XML-Writer: several escaping issues (with patch against 0.602)

Fri Mar 16 19:48:13 2007 jflack [...] math.purdue.edu - Ticket created

Subject:	several escaping issues (with patch against 0.602)
Date:	Fri, 16 Mar 2007 11:53:32 -0400
To:	bug-XML-Writer [...] rt.cpan.org
From:	Chapman Flack <jflack [...] math.purdue.edu>

This patch addresses several escaping issues: Attribute value literals: Whitespace: Rev. 0.530 added a line to escape newlines in attribute values. This is necessary but not sufficient, because Attribute-Value Normalization (sec. 3.3.3) specifies that all three of the characters #xD, #xA, and #x9 will be replaced with #x20 when the output is read by a conforming parser, so to avoid data corruption all three must be escaped. Interoperability: Attribute-value escaping did not replace ]]> with >>> as MUST be done for interoperability (sec 2.4, Character Data and Markup). This was also an issue for general escaping done by $characters below. Quoting: While making these changes, it was easy enough to also choose "" or '' quoting per attribute based on which will require less escaping. General $characters: Interoperability: Attribute-value escaping did not replace ]]> with >>> as MUST be done for interoperability (sec 2.4, Character Data and Markup). This was also an issue for attribute-value escaping below. General efficiency: Some logic was duplicated between $characters, $showAttributes, and _escapeLiteral. $escapeEncoding was done as a separate step unnecessarily. The single routine $xmlEscape is now common to all uses. Since the rules for how to escape a character never change, but which characters need escaping depends on the circumstances, $xmlEscape is simply passed a compiled regexp (different per call site) that tells it which characters to escape. In the case of ASCII encoding, the nonASCII characters are simply compiled into this regexp at the time of object construction. There is no setEncoding() mutator to change the encoding after construction, so this is safe to do. All necessary escaping is now done in a single pass. Chapman Flack Mathematics, Purdue University

--- Writer.pm.orig Thu Feb 22 10:00:02 2007 +++ Writer.pm Fri Mar 2 15:30:19 2007 @@ -17,7 +17,6 @@ use IO::Handle; $VERSION = "0.602"; - ######################################################################## # Constructor. @@ -62,11 +61,10 @@ my ($checkUnencodedRepertoire, $escapeEncoding); if (lc($outputEncoding) eq 'us-ascii') { $checkUnencodedRepertoire = \&_croakUnlessASCII; - $escapeEncoding = \&_escapeASCII; + $escapeEncoding = qr/[^\x00-\x7f]/; } else { - my $doNothing = sub {}; - $checkUnencodedRepertoire = $doNothing; - $escapeEncoding = $doNothing; + $checkUnencodedRepertoire = sub {}; + $escapeEncoding = qr/(?=a)b/; # just an re that can't ever match } # Parse variables @@ -80,6 +78,23 @@ my @hasElementStack = (); my $hasHeading = 0; # Does this document have anything before the first element? + my $xmlEscape = sub { + my ( $str, $patToEscape ) = @_; + $str =~ s/$patToEscape/_escapeFor($&)/eg; + return $str; + }; + + my $attPatDQ = qr{[\t\n\r&<"]|]]>|$escapeEncoding}; + my $attPatSQ = qr{[\t\n\r&<']|]]>|$escapeEncoding}; + my $charPat = qr{[&<]|]]>|$escapeEncoding}; + + my $escapeAttribute = sub { + my ( $str ) = @_; + return ( ( $str =~ tr/"// ) > ( $str =~ tr/'// ) ) + ? "'" . &{$xmlEscape}($str,$attPatSQ) . "'" + : '"' . &{$xmlEscape}($str,$attPatDQ) . '"'; + }; + # # Private method to show attributes. # @@ -88,10 +103,8 @@ my $i = 1; while ($atts->[$i]) { my $aname = $atts->[$i++]; - my $value = _escapeLiteral($atts->[$i++]); - $value =~ s/\x0a/\&#10\;/g; - &{$escapeEncoding}($value); - $output->print(" $aname=\"$value\""); + my $value = &{$escapeAttribute}($atts->[$i++]); + $output->print(" $aname=$value"); # $value includes the needed quotes } }; @@ -340,14 +353,7 @@ }; my $characters = sub { - my $data = $_[0]; - if ($data =~ /[\&\<\>]/) { - $data =~ s/\&/\&amp\;/g; - $data =~ s/\</\&lt\;/g; - $data =~ s/\>/\&gt\;/g; - } - &{$escapeEncoding}($data); - $output->print($data); + $output->print(&{$xmlEscape}($_[0],$charPat)); $hasData = 1; }; @@ -738,24 +744,6 @@ } } -# -# Private: escape an attribute value literal. -# -sub _escapeLiteral { - my $data = $_[0]; - if ($data =~ /[\&\<\>\"]/) { - $data =~ s/\&/\&amp\;/g; - $data =~ s/\</\&lt\;/g; - $data =~ s/\>/\&gt\;/g; - $data =~ s/\"/\&quot\;/g; - } - return $data; -} - -sub _escapeASCII($) { - $_[0] =~ s/([^\x00-\x7F])/sprintf('&#x%X;', ord($1))/ge; -} - sub _croakUnlessASCII($) { if ($_[0] =~ /[^\x00-\x7F]/) { croak('Non-ASCII characters are not permitted in this part of a US-ASCII document'); @@ -770,6 +758,21 @@ } } +my %specialEscapes = ( # an re determines which of these to replace in any case + '<', '<', + '>', '>', # MAY be represented using the string > ... + '&', '&', + "'", ''', + '"', '"', + ']]>', ']]>' # MUST, for compatibility, be escaped in the string ]]> +); # any other char will be &#x'd if escaping needed + +sub _escapeFor($) { + return exists($specialEscapes{$_[0]}) + ? $specialEscapes{$_[0]} + : sprintf '&#x%x;', ord($_[0]); +} + ######################################################################## # XML::Writer::Namespaces - subclass for Namespace processing.

Tue Mar 27 18:58:06 2007 joe [...] kafsemo.org - Taken

Mon May 07 19:13:45 2007 joe [...] kafsemo.org - Correspondence added

From:

JOSEPHW [...] cpan.org

On Fri Mar 16 19:48:13 2007, jflack@math.purdue.edu wrote: Show quoted text

> This patch addresses several escaping issues: > > Attribute value literals: > > Whitespace:

Good catch; I'm escaping all three characters now. Show quoted text

> > Interoperability: > > Attribute-value escaping did not replace ]]> with >>> > as MUST be done for interoperability (sec 2.4, Character Data > and Markup). This was also an issue for general escaping > done by $characters below.

Here's the test I've added for this issue: # #25499 - ]]> must be represented as ]]< in attributes TEST: { initEnv(); $w->emptyTag('x', 'a' => ']]>'); $w->end(); checkResult("<x a=\"]]>\" />\n", "]]> must be escaped in attributes"); }; However, it's passing with the current release - have I understood the issue correctly? Show quoted text

> Quoting: > > While making these changes, it was easy enough to also > choose "" or '' quoting per attribute based on which will > require less escaping.

This is certainly a smart piece of functionality. However, I'm hesitant to make a change that would alter existing behaviour. For example, I know people shouldn't be processing output with fragile regular expressions, but I'm sure they are. One possibility is to make it optional behaviour. Show quoted text

> General efficiency: > > Some logic was duplicated between $characters, $showAttributes, > and _escapeLiteral. $escapeEncoding was done as a separate step > unnecessarily.

... Thanks - is there any chance you have micro-benchmarks for this change? (I'll write my own otherwise, but it would save time.) My main concern is obviously correctness, but I'll certainly adopt performance fixes as well, if there's a noticeable difference and it doesn't hurt compatibility with older perls.

Mon May 07 19:13:46 2007 The RT System itself - Status changed from 'new' to 'open'

Mon Jun 25 18:05:14 2007 joe [...] kafsemo.org - Correspondence added

Fixed in 0.603. I haven't adopted the compiled regexp approach because, while it's a lot cleaner than what's in there at the moment, it would break on some very old perls that otherwise work, but thanks anyway. I'll certainly consider using it in any major overhaul.

Mon Jun 25 18:05:15 2007 joe [...] kafsemo.org - Status changed from 'open' to 'resolved'