Skip Menu |

This queue is for tickets about the XML-Simple CPAN distribution.

Report information
The Basics
Id: 86766
Status: rejected
Priority: 0/
Queue: XML-Simple

People
Owner: grantm [...] cpan.org
Requestors: mittra [...] juno.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Problem related to character encoding
Date: Mon, 8 Jul 2013 01:49:00 GMT
To: bug-XML-Simple [...] rt.cpan.org
From: "Swapnajit Mitra" <mittra [...] juno.com>
Hello, This problem may be related to Bug 36765. I have an XML that has the following header: <?xml version="1.0" encoding="Windows-1255" ?> Character encoding "Windows-1255" is for Hebrew. When I try to use XMLin, I get the following error message: Entity: line 40238: parser error : Entity 'zwj' not defined &#9618;&#9579;£&#9555;&#9571;&#9579;ö&#9555;&#9570;&#9579;Ö&#9579;Ü&#9555;&#9557;, &#9579;£&#9555;&#9508;&cap;¼¬&#9555;&#9617;&#9579;&#8359;&#9555;&#9571;&#9579;¿ &#9579;£&#9555;&#9558;&#9579;ó&#9555;&#9619;&cap;¼½&cap;¡ï&#9579;¬ &#9579;É&#9555;&#9570;&#9579;¬-&cap;¼&#9559;&#9555;&#9557;&#9579;£-&#9579;&#8359;&#9555;&#9508;&#9579;ª&#9555;&#9617;&#9579;ò&zwj; ^The problem is coming because instead of interpreting the Hebrew character, the parser is interpreting it as something that starts with & and ends with ; and thus a special HTML code. I understand from the description of Bug 36765 that this could be a parser problem, and I may need to change it to a different parser. My questions are: a) How do I know which parser my XML::Simple is using? I did not compile anything and use whatever came with Strawberry Perl.b) How do I change the parser for XML::Simple? Will simply installing either XML::SAX::Expat or XML::SAX::ExpatXS ensure that the parser will change? A point to note: Internet Explorer reported the same bogus problem in IE 8 but not in IE 10.-- Swapnajit Mitra Show quoted text
____________________________________________________________ Want to place your ad here? Advertise on United Online http://thirdpartyoffers.juno.com/TGL3131/51da1abbd36ca1abb2ba2st01vuc
The sequence "&zwj;" is an HTML named character entity for the Unicode character U+200D ("ZERO WIDTH JOINER"). XML::Simple is not an HTML parser and although it recognises "&zwj;" as a named character entity reference, it won't be able to resolve that name to a character unless your XML document includes a DTD which defines the mapping. If your XML document does not include a DTD then the presence of a named character entity means the document is "not well formed" and a parser module *must* reject it. If your document does contain a DTD defining the mapping for "&zwj;" then the problem may be that the parser module you're using either doesn't support DTDs or needs some option specified in order for the DTD to be used. To determine which parser module is the default, you could run this command: perl -MXML::SAX -le "print ref XML::SAX::ParserFactory->parser()" So in summary: the problem is either with your XML, or with the parser module being called by XML::Simple. However, I strongly recommend that you do not use XML::Simple. A much better choice would be XML::LibXML as described here: http://www.perlmonks.org/index.pl?node_id=490846 One advantage of XML::LibXML is that it can parse both XML and HTML documents which may allow you to work with malformed documents. Regards Grant