Skip Menu |

This queue is for tickets about the XML-Simple CPAN distribution.

Report information
The Basics
Id: 17687
Status: rejected
Priority: 0/
Queue: XML-Simple

People
Owner: Nobody in particular
Requestors: FBRIERE [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Meddling with $/ screws up UTF-8 decoding
I have no idea how this could even happen, but changing $/ appears to disable UTF-8 decoding. Here's a snippet that shows how decoding usually works as expected: $ perl -MXML::Simple -MData::Dumper -le 'print Dumper XMLin qq{<?xml version="1.0" encoding="utf-8"?><foo>Fr\x{c3}\x{a9}d\x{c3}\x{a9}ric Bri\x{c3}\x{a8}re</foo>}' $VAR1 = "Fr\x{e9}d\x{e9}ric Bri\x{e8}re"; And here's the same snippet, but with -00: $ perl -00 -MXML::Simple -MData::Dumper -le 'print Dumper XMLin qq{<?xml version="1.0" encoding="utf-8"?><foo>Fr\x{c3}\x{a9}d\x{c3}\x{a9}ric Bri\x{c3}\x{a8}re</foo>}' $VAR1 = 'Frédéric Brière'; Strange. Other $/ values seems to have the same effect. However, if $/ is set back to "\n", or if -012 is specified, then everything is dandy again. This does not occur with XML_SIMPLE_PREFERRED_PARSER=XML::Parser. But stranger still, this does not appear to be a problem with XML::LibXML either: $ perl -00 -MXML::LibXML -MData::Dumper -le '$parser = new XML::LibXML; print Dumper $parser->parse_string(qq{<?xml version="1.0" encoding="utf-8"?><foo>Fr\x{c3}\x{a9}d\x{c3}\x{a9}ric Bri\x{c3}\x{a8}re</foo>})->documentElement->textContent' $VAR1 = "Fr\x{e9}d\x{e9}ric Bri\x{e8}re"; Weird, eh?
This is not a bug with XML::Simple - different SAX parser modules give different results. You state that it is not a problem with XML::LibXML yet you did not test the SAX API of that module. XML::Simple does not use the DOM API that you used in your test. Note also that there are subtle traps to beware of when parsing from a string. For example, consider this: perl -MDevel::Peek=Dump -e '$x = "\x{c3}"; print Dump($x)' SV = PV(0x814eb00) at 0x814e648 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x81645a8 "\303"\0 CUR = 1 LEN = 2 Note that Perl's internal representation of the string is one byte long Therefore when you do this, XMLin qq{<?xml version="1.0" encoding="utf-8"?><foo>Fr\x{c3} ... } you're telling the XML parser that you're giving it a UTF8 encoded string but in truth, not all the characters beyond 0x7F are encoded using multiple bytes as required by UTF8. Good luck. Grant
Subject: Re: [rt.cpan.org #17687] Meddling with $/ screws up UTF-8 decoding
Date: Tue, 21 Feb 2006 04:05:31 -0500
To: via RT <bug-XML-Simple [...] rt.cpan.org>
From: Frédéric Brière <fbriere [...] fbriere.net>
On Tue, Feb 21, 2006 at 03:43:19AM -0500, via RT wrote: Show quoted text
> You state that it is not a problem with XML::LibXML yet you did not test > the SAX API of that module. XML::Simple does not use the DOM API that > you used in your test.
Ah. I'm still trying to wrap my head around this whole SAX thing, so I was mostly poking blindly at the dark. Show quoted text
> Note also that there are subtle traps to beware of when parsing from a > string. For example, consider this: > > perl -MDevel::Peek=Dump -e '$x = "\x{c3}"; print Dump($x)' > > Note that Perl's internal representation of the string is one byte long
Of course, as \xc3 is a single byte. Show quoted text
> Therefore when you do this, > > XMLin qq{<?xml version="1.0" encoding="utf-8"?><foo>Fr\x{c3} ... } > > you're telling the XML parser that you're giving it a UTF8 encoded > string but in truth, not all the characters beyond 0x7F are encoded > using multiple bytes as required by UTF8.
But the input I provided *was* well-formed UTF-8. (For example, "\x{c3}\x{a9}" to represent U+00e9.) BTW, I should point out that both XML::Parser and XML::LibXML go nuts when fed a string that has the utf8 flag turned on. (They feed on the internal representation of the string.) Am I right in assuming that you pass them their input as a pure Perl string, and that it's both of these modules that are at fault? -- Frédéric Brière <*> fbriere@fbriere.net => <fbriere@abacom.com> IS NO MORE: <http://www.abacomsucks.com> <=