Subject: | Meddling with $/ screws up UTF-8 decoding |
I have no idea how this could even happen, but changing $/ appears to
disable UTF-8 decoding. Here's a snippet that shows how decoding
usually works as expected:
$ perl -MXML::Simple -MData::Dumper -le 'print Dumper XMLin qq{<?xml
version="1.0" encoding="utf-8"?><foo>Fr\x{c3}\x{a9}d\x{c3}\x{a9}ric
Bri\x{c3}\x{a8}re</foo>}'
$VAR1 = "Fr\x{e9}d\x{e9}ric Bri\x{e8}re";
And here's the same snippet, but with -00:
$ perl -00 -MXML::Simple -MData::Dumper -le 'print Dumper XMLin
qq{<?xml version="1.0"
encoding="utf-8"?><foo>Fr\x{c3}\x{a9}d\x{c3}\x{a9}ric
Bri\x{c3}\x{a8}re</foo>}'
$VAR1 = 'Frédéric Brière';
Strange. Other $/ values seems to have the same effect. However, if $/
is set back to "\n", or if -012 is specified, then everything is dandy
again.
This does not occur with XML_SIMPLE_PREFERRED_PARSER=XML::Parser. But
stranger still, this does not appear to be a problem with XML::LibXML
either:
$ perl -00 -MXML::LibXML -MData::Dumper -le '$parser = new
XML::LibXML; print Dumper $parser->parse_string(qq{<?xml version="1.0"
encoding="utf-8"?><foo>Fr\x{c3}\x{a9}d\x{c3}\x{a9}ric
Bri\x{c3}\x{a8}re</foo>})->documentElement->textContent'
$VAR1 = "Fr\x{e9}d\x{e9}ric Bri\x{e8}re";
Weird, eh?