Subject: | XML::SAX::PurePerl causes parse_string() to crash when handling UTF-8 combining characters |
When XML::Sax is handed a string to parse that has UTF-8 combining
characters in it, and XML::SAX::PurePerl is the SAX parser, it dies with
an error:
Cannot decode string with wide characters at
/usr/local/lib/perl/5.8.4/Encode.pm line 188.
I've attached a short script that demonstrates this problem on my Linux
box (debian sarge) running kernel 2.6.12-1.1372_FC3, Perl v5.8.5.
The application I'm working with, Koha (http://koha.org) is an
open-source integrated library automation system (library as in public
library), which uses the MARC::File::XML module (which uses XML::SAX) to
handle bibliographic records in the MARCXML format. This bug is a major
problem for us as we have many users who have records in their system
with combining characters.
I'm sorry I don't have a patch, I'm still pretty new to SAX and encoding
issues in general. Thanks!
Subject: | parsercrash.pl |
#!/usr/bin/perl
use XML::SAX;
my $parser = XML::SAX::ParserFactory->parser(
Handler => MySAXHandler->new
);
binmode STDOUT, ":utf8";
print "\x{65}\x{301}\n";
$parser->parse_string("<xml>\xEF\xBB\xBF\x{65}\x{301}</xml>");
package MySAXHandler;
use base qw(XML::SAX::Base);
sub start_document {
my ($self, $doc) = @_;
# process document start event
}
sub start_element {
my ($self, $el) = @_;
# process element start event
}