Subject: | charset decoding is broken |
On 0.15, when reading from a stream, XML::SAX::PurePerl does not decode
the first 4096 bytes into the Perl internal UTF-8 representation,
although it sets the filehandle PerlIO encoding to UTF-8.
This is a regression from 0.12. The problem is that the Unicode version
of XML::SAX::PurePerl::Reader::switch_encoding_string() uses
Encode::from_to(), which does not set the Perl internal UTF-8 flag.
Replacing this with eg. Encode::decode() fixes the bug. This affects
those bytes that are first read into the buffer before setting the
PerlIO encoding.
With the fix to SAX/PurePerl/Reader/UnicodeExt.pm, there's a test
failure from t/14encoding.t. It turns out that there are bugs in
XML::SAX::PurePerl::Productions : the $NameChar regexp shouldn't use
$Letter, since that contains beginning and end anchors (^ and $). In
fact, it looks like the $Letter production is unused now and $NameChar
shouldn't have any anchors either. (It also looks like the binding of
the anchors is broken, since /^a|b$/ means (/^a/ || /b$/), not /^(a|b)$/.)
I'm attaching a proposed patch that adds a testcase for these issues and
fixes them for me. The tests pass for me on 0.12 and fail on 0.15. I
haven't tested on an old non-Unicode Perl; this is on Perl 5.8.8 on
Debian Etch (4.0).
I'm a bit uneasy that switch_encoding_string() can't be called twice now
without a fatal error, but I'm not sure what is the best thing to do.
Maybe just make it a no-op if the new charset is UTF-8 and
Encode::is_utf8 is set? I suppose it has never worked if the charset is
not UTF-8 on the second call....
FWIW, this issue has caused Debian bug #405186,
http://bugs.debian.org/405186 .
Please let me know if you need more information; I'll be happy to help
in any way I can.
Cheers,
--
Niko Tyni
ntyni@iki.fi
Subject: | XML-SAX-patch-new |
Message body not shown because it is not plain text.