Subject: | Error with siybols from iso-8859-1 in utf-8 xml file |
I found the following problem:
If we try to load and then write back to disk the following xml-file
<root>
<p>«English Text»</p>
</root>
with following script:
use strict;
use XML::DOM;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ('test.xml');
$doc->printToFile ('result.xml');
We will get xml file in wrong encoding. Symbols « and » will be
encoded as if they are in iso-8859-1 encoding, although, if there
are some other non iso8859-1 symbols in other tags at other lines
(russian letters for example) they would be encoded correctly.
Moreover if we add at least one Russian letter in English Text
( <p>«English Text» А - Я </p> ), symbols « and » will be encoded
correctly...
There is a way to workaround this problem in XML::DOM 1.34:
use strict;
use XML::DOM;
use Encode;
my $parser = new XML::DOM::Parser;
my $doc = $parser->parsefile ('test.xml');
my $encoding=$doc->getXMLDecl()->getEncoding();
my $data = decode("utf8",$doc->toString);
open DST,">:encoding($encoding)",'result2.xml';
print DST $data;
close DST;
But it does not work in 1.44. In 1.44 it will also give wrong result.
It would be good if it works correctly without workarounds and
with all versions... :-/
--------------------- Information about my system:
$ perl -v
This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
$ perl -e'use XML::DOM; print $XML::DOM::VERSION,"\n";'
1.43
$ echo $LANG
ru_RU.KOI8-R
$ cat /etc/issue
Debian GNU/Linux lenny/sid \n \l