Subject: | Unfortunate Recovery Moves Elements Around |
Date: | Fri, 7 Jan 2011 00:13:40 -0800 |
To: | bug-xml-libxml [...] rt.cpan.org |
From: | "David E. Wheeler" <dwheeler [...] cpan.org> |
Given this XML:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>No Closing, Man</title>
<link>http://blog.noclosing.com</link>
<language>en-us</language>
<ttl>40</ttl>
<description>This is horked.</description>
<item>
<dc:creator>Jo Mama</dc:creator>
<title>Welcome to the Jungle</title>
<description><p><span>hi</p></description>
<pubDate>Fri, 17 Dec 2010 16:35:00 +0000</pubDate>
<guid>http://blog.noclosing.com/2710</guid>
<link>http://blog.noclosing/2710.html</link>
</item>
<item>
<dc:creator>Jamie</dc:creator>
<title>Whatever</title>
<description>This is the description</description>
<pubDate>Fri, 31 Dec 2010 15:12:00 +0000</pubDate>
<guid>http://blog.noclosing.com/2722</guid>
<link>http://blog.noclosing/2722.html</link>
</item>
</channel>
</rss>
Where the closing </span> is missing on line 12, I run
use 5.12.0;
use XML::LibXML;
my $parser = XML::LibXML->new({
recover => 2,
no_network => 1,
no_blanks => 1,
no_cdata => 1,
});
$parser->recover(2);
say $parser->load_xml(string => $xml)->toString;
And XML::LibXML emits (I've run it through tidy here):
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>No Closing, Man</title>
<link>http://blog.noclosing.com</link>
<language>en-us</language>
<ttl>40</ttl>
<description>This is horked.</description>
<item>
<dc:creator>Jo Mama</dc:creator>
<title>Welcome to the Jungle</title>
<description>
<p>
<span>hi</span>
</p>
<pubDate>Fri, 17 Dec 2010 16:35:00 +0000</pubDate>
<guid>http://blog.noclosing.com/2710</guid>
<link>http://blog.noclosing/2710.html</link>
</description>
<item>
<dc:creator>Jamie</dc:creator>
<title>Whatever</title>
<description>This is the description</description>
<pubDate>Fri, 31 Dec 2010 15:12:00 +0000</pubDate>
<guid>http://blog.noclosing.com/2722</guid>
<link>http://blog.noclosing/2722.html</link>
</item>
</item>
</channel>
</rss>
Note that the closing </span> is nicely included, so it recovered that. However, the second <item> element has been moved inside the first! You can see this clearly by the nested closing </item> tags four lines from the bottom. I don't know if this is an XML::LibXML bug or libxml2 bug, but I've attached a test case using this example.
Best,
David
.
Message body is not shown because sender requested not to inline it.