Subject: | XML vs. [X]HTML parsing |
Date: | Wed, 11 Feb 2009 09:35:05 -0800 (PST) |
To: | bug-XML-Atom [...] rt.cpan.org |
From: | Kevin Vargo <vargok [...] yahoo.com> |
Hi,
We're using v0.33 of XML::Atom, and noticed that sometimes XHTML fragments will get marked down to escaped <content type="text">. This appears to be the result of LibXML returning an invalid parse of the content, due to -- valid in XHTML, and not valid in XML. I note that LibXML has a parse_html_string mode that appears do The Right Thing here, but have not verified it in the code.
The are of code seems to be in: Content.pm around where the eval{... } and check for LIBXML occurs; $node is returned empty from the parse attempt. Replacing for   runs through valid as xhtml.
Basically, if $node comes back empty from the eval, I the parse again, but via the html method, and it comes in as xhtml what appears to be properly.
Something along the lines of the following should work -- once proper error handling has been added:
--- /usr/lib/perl5/site_perl/5.8.8/XML/Atom/Content.pm 2009-02-11 12:32:36.000000000 -0500
+++ /home/vargo/tmp/Content.pm-vargo 2010-02-11 12:32:58.000000000 -0500
@@ -63,6 +63,13 @@
if $xp;
}
};
+
+ if (! $node) {
+ my $parser = XML::LibXML->new;
+ my $tree = $parser->parse_html_string($copy);
+ $node = $tree->getDocumentElement;
+ }
+
if (!$@ && $node) {
$elem->appendChild($node);
if ($content->version == 0.3) {
Thanks,
Kevin