Skip Menu |

This queue is for tickets about the XML-DOM CPAN distribution.

Report information
The Basics
Id: 6293
Status: resolved
Priority: 0/
Queue: XML-DOM

People
Owner: Nobody in particular
Requestors: Thorsten.Meinl [...] informatik.uni-erlangen.de
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 1.42
Fixed in: 1.45



Subject: toString() garbles "Umlaute"
Hi, The following setup: - perl v5.8.0 built for i586-linux-thread-multi (SuSE 8.2) - XML::DOM 1.42 use strict; use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile($ARGV[0]); print $doc->toString(); - read in the attached file and print it back unchanged - now the output is not valid UTF-8 anymore. The part between the quotes ("Interdisziplin....") gets garbled, the german "Umlaute" are broken. It seems to me that the encodeText function does something wrong but I could not track it down to the real error. Thorsten
<?xml version="1.0" encoding="UTF-8"?> <summary>Im Rahmen des Graduiertenkollegs &#13; &#13; "Interdisziplinärer Entwurf verläßlicher Multitechnologie-Systeme", &#13; &#13; an dem die Lehrstühle Informatik II, III, IV&#13; und VII sowie der Lehrstuhl für Rechnergestützten Schaltungsentwurf (Prof.&#13; Glauert) und der Lehrstuhl für Konstruktionstechnik (Prof. Meerkamm) beteiligt&#13; sind, wird dieses Semester ein Hauptseminar angeboten. &#13; &#13; Zum Hauptseminar sind alle herzlich eingeladen.</summary>
From: fschlich [...] zedat.fu-berlin.de
I believe this is the same problem as Debian bug #324882 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=324882) From the Debian package, I am attaching a patch that should fix the issue, as well as a test.
Subject: encoding_test.patch
Description: testcase for encoding issues in parsing and printing to a file Author: Martín Ferrari <tincho@debian.org> --- /dev/null +++ b/t/encodings.t @@ -0,0 +1,48 @@ +#!/usr/bin/perl +use strict; +use utf8; # for embedded strings +use XML::DOM; +use Test::More tests => 16; +use Test::NoWarnings; +use constant TMPFILE => "test_encoding.xml"; + +my $str = +q(<?xml version="1.0" encoding="UTF-8"?> +<blah> + <foo baz="">&#227;&#63720;</foo> + <bar>ﭾﭿ</bar> +</blah>); + +# test 1 -- check for correct parsing of input string +my $parser = new XML::DOM::Parser; +my $doc = eval { $parser->parse($str); }; +ok(((not $@) && defined $doc), 'loads ok, parses str'); + +try($doc); +$doc->printToFile(TMPFILE); +$doc->dispose; + +ok(system("xmllint", "--noout", TMPFILE) == 0, 'xmllint runs ok'); + +my $doc2 = eval { $parser->parsefile(TMPFILE) }; +ok(((not $@) && defined $doc2), 'parses TMPFILE ok'); + +try($doc2); +$doc2->dispose; +unlink TMPFILE; + +sub try { + my $doc = shift; + my $foo = ${$doc->getDocumentElement->getElementsByTagName("foo")}[0]; + my $bar = ${$doc->getDocumentElement->getElementsByTagName("bar")}[0]; + my $baz = $foo->getAttribute("baz"); + my $footext = $foo->getFirstChild->getData; + my $bartext = $bar->getFirstChild->getData; + + ok(utf8::is_utf8($baz), 'baz is_utf8...'); + is($baz, "\x{E4B6}\x{E4B7}", '...and correct'); + ok(utf8::is_utf8($footext), 'footext is_utf8...'); + is($footext, "\xE3\x{F8E8}", '...and correct'); + ok(utf8::is_utf8($bartext), 'bartext is_utf8'); + is($bartext, "\x{FB7E}\x{FB7F}\x{E4B5}\x{E4B6}\x{E4B7}\x{E4B8}\x{E4B9}\x{E4BA}\x{E4BB}\x{E4BC}\x{E4BD}\x{E4BE}", 'and correct'); +}
Subject: output_encoding.patch
Description: properly encode output for printToFile closes: #324882 Author: Gregor Herrmann <gregoa@debian.org> --- a/lib/XML/DOM.pm +++ b/lib/XML/DOM.pm @@ -1218,7 +1218,8 @@ sub to_sax sub printToFile { my ($self, $fileName) = @_; - my $fh = new FileHandle ($fileName, "w") || + my $encoding = $self->getXMLDecl()->getEncoding(); + my $fh = new FileHandle ($fileName, ">:encoding($encoding)") || croak "printToFile - can't open output file $fileName"; $self->print ($fh);
Thanks, these patches were applied to 1.45 release.