Subject: | XML::SAX::PurePerl does not parse properly when given non-ascii input |
Hiya,
With XML::SAX::PurePerl when the input contains UTF-8 characters these
are decoded only within the first callback - and the last characters
may be corrupted if they are an incomplete UTF-8 sequence. Subsequent
callbacks return bare UTF-8 without any decoding at all.
The attached file shows the problem. Essentially it tries to parse a 3-
byte UTF-8 sequence repeated over and over within a CDATA section. The
test code prints out the values that it received as the callbacks.
The inner string is constructed simply as:
$string .= (("\x{e2}\x{80}\x{99}" x 8) . "\n") x 96;
so that we get a lot of these sequences, all lined up in a row.
The callbacks are Dumped with Data::Dumper and produce:
----8<----
$VAR1 = {
'Data' => '
'
};
$VAR1 = {
'Data' => "
\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}
[snip many identical lines...]
\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{fffd}\x{fffd}"
};
$VAR1 = {
'Data' => "\x{99}\x{e2}\x{80}\x{99}\x{e2}\x{80}\x{99}
[snip...]
----8<----
As can be seen, the callback is correct initially, but at the buffer
edge becomes corrupted, and subsequently stops processing the UTF-8 at
all.
There is a workaround - stop using XML::SAX::PurePerl by overriding
$XML::SAX::ParserPackage to be any of the other parsers. But it
shouldn't be necessary to /have/ to override the bundled parser because
it doesn't work, obviously.
This is probably related to bug#74666 which was reported 8 months ago.
Subject: | test-xml-sax-parse.pl |
#!/usr/bin/perl -w
##
# Test what happens to UTF-8 strings with the XML::SAX parser.
# ie they break. Which is amusing.
use XML::SAX;
use Data::Dumper;
package MySAXHandler;
use base qw(XML::SAX::Base);
sub characters {
my ($self, $el) = @_;
# We just want to see what we get called with.
# The first callback is a newline.
# The second is a collection of correctly decoded characters,
# except the end is broken
# The third is just a sequence of UTF-8 characters passed through.
print Data::Dumper::Dumper($el);
}
$string = <<EOM;
<?xml version="1.0" encoding="utf-8"?>
<feed>
<![CDATA[
EOM
$string .= (("\x{e2}\x{80}\x{99}" x 8) . "\n") x 96;
$string .= "]]></feed>";
# Works (splits at newlines, but works even if the newlines are removed)
$XML::SAX::ParserPackage = "XML::SAX::Expat";
# Works
$XML::SAX::ParserPackage = "XML::LibXML::SAX";
# Works
$XML::SAX::ParserPackage = "XML::LibXML::SAX::Parser";
# BREAKS:
$XML::SAX::ParserPackage = "XML::SAX::PurePerl";
# create the parser
my $p = XML::SAX::ParserFactory->parser(
Handler => MySAXHandler->new
);
# do it
$p->parse_string($string);