Subject: | XML::SAX::PurePerl incorrectly (and silently) parses CDATA if the end maker is split between buffers. |
Date: | Wed, 16 Jul 2014 18:22:38 -0400 |
To: | bug-XML-SAX [...] rt.cpan.org |
From: | Vadim Grinshpun <vgrinshp [...] akamai.com> |
Hi,
There's a bug in XML::SAX::PurePerl (XML-SAX-0.99) that causes CDATA
sections to be misparsed if the terminating ]]> characters fall on both
sides of a buffer boundary;
e.g., if buffer ends with "]" or "]]" and the rest of the delimiter has
not yet been read in. It looks like the partial delimiter characters are
simply discarded, and the end of CDATA is not detected.
If no further CDATA occurs, an exception is thrown.
If additional CDATA sections exist, their end marker will likely be
seen, and the failure will be silent, but the parsed content will be
wrong (i.e., all XML between start of first CDATA and end of next CDATA
will be considered part of the first CDATA section).
Here's a quick script to reproduce the behavior (forcing the former
case; to make it the latter, add one more CDATA section to $xml):
#!/usr/bin/perl
use strict;
use warnings;
use lib ( "$ENV{HOME}/downloads/XML-SAX-0.99/" );
use XML::Simple;
$ENV{XML_SIMPLE_PREFERRED_PARSER}="XML::SAX::PurePerl";
my $print = 1;
for my $buflen ( 1000 .. 5000 )
{
my $xml = "<x><a><![CDATA[";
my $remain = $buflen - length($xml);
my $filler = 'Z' x $remain;
$xml .= $filler;
$xml .= "]]></a></x>";
eval
{
my $x = XMLin( $xml, forcearray => 1, keyattr => [] );
if ( $print ) # this is just to prove I'm using the expected
pureperl version
{
print $INC{'XML/SAX/PurePerl.pm'} . "\n";
$print = 0;
}
my $parsed = $x->{a};
for my $v ( @$parsed )
{
if ( $v =~ /[^Z]/ )
{
die "v: '$v'\n";
}
}
};
if ( $@ )
{
die "misparse: buflen: $buflen\n\nxml: '$xml'\n\n$@\n";
}
}