Skip Menu |

This queue is for tickets about the XML-SAX CPAN distribution.

Report information
The Basics
Id: 79816
Status: new
Priority: 0/
Queue: XML-SAX

People
Owner: Nobody in particular
Requestors: gerph [...] gerph.org
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 0.99
Fixed in: (no value)



Subject: XML::SAX::PurePerl does not parse properly when given non-ascii input
Hiya, With XML::SAX::PurePerl when the input contains UTF-8 characters these are decoded only within the first callback - and the last characters may be corrupted if they are an incomplete UTF-8 sequence. Subsequent callbacks return bare UTF-8 without any decoding at all. The attached file shows the problem. Essentially it tries to parse a 3- byte UTF-8 sequence repeated over and over within a CDATA section. The test code prints out the values that it received as the callbacks. The inner string is constructed simply as: $string .= (("\x{e2}\x{80}\x{99}" x 8) . "\n") x 96; so that we get a lot of these sequences, all lined up in a row. The callbacks are Dumped with Data::Dumper and produce: ----8<---- $VAR1 = { 'Data' => ' ' }; $VAR1 = { 'Data' => " \x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{2019} [snip many identical lines...] \x{2019}\x{2019}\x{2019}\x{2019}\x{2019}\x{fffd}\x{fffd}" }; $VAR1 = { 'Data' => "\x{99}\x{e2}\x{80}\x{99}\x{e2}\x{80}\x{99} [snip...] ----8<---- As can be seen, the callback is correct initially, but at the buffer edge becomes corrupted, and subsequently stops processing the UTF-8 at all. There is a workaround - stop using XML::SAX::PurePerl by overriding $XML::SAX::ParserPackage to be any of the other parsers. But it shouldn't be necessary to /have/ to override the bundled parser because it doesn't work, obviously. This is probably related to bug#74666 which was reported 8 months ago.
Subject: test-xml-sax-parse.pl
#!/usr/bin/perl -w ## # Test what happens to UTF-8 strings with the XML::SAX parser. # ie they break. Which is amusing. use XML::SAX; use Data::Dumper; package MySAXHandler; use base qw(XML::SAX::Base); sub characters { my ($self, $el) = @_; # We just want to see what we get called with. # The first callback is a newline. # The second is a collection of correctly decoded characters, # except the end is broken # The third is just a sequence of UTF-8 characters passed through. print Data::Dumper::Dumper($el); } $string = <<EOM; <?xml version="1.0" encoding="utf-8"?> <feed> <![CDATA[ EOM $string .= (("\x{e2}\x{80}\x{99}" x 8) . "\n") x 96; $string .= "]]></feed>"; # Works (splits at newlines, but works even if the newlines are removed) $XML::SAX::ParserPackage = "XML::SAX::Expat"; # Works $XML::SAX::ParserPackage = "XML::LibXML::SAX"; # Works $XML::SAX::ParserPackage = "XML::LibXML::SAX::Parser"; # BREAKS: $XML::SAX::ParserPackage = "XML::SAX::PurePerl"; # create the parser my $p = XML::SAX::ParserFactory->parser( Handler => MySAXHandler->new ); # do it $p->parse_string($string);