Skip Menu |

This queue is for tickets about the XML-Atom CPAN distribution.

Report information
The Basics
Id: 43212
Status: new
Priority: 0/
Queue: XML-Atom

People
Owner: Nobody in particular
Requestors: vargok [...] yahoo.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: XML vs. [X]HTML parsing
Date: Wed, 11 Feb 2009 09:35:05 -0800 (PST)
To: bug-XML-Atom [...] rt.cpan.org
From: Kevin Vargo <vargok [...] yahoo.com>
Hi, We're using v0.33 of XML::Atom, and noticed that sometimes XHTML fragments will get marked down to escaped <content type="text">. This appears to be the result of LibXML returning an invalid parse of the content, due to &nbsp; -- valid in XHTML, and not valid in XML. I note that LibXML has a parse_html_string mode that appears do The Right Thing here, but have not verified it in the code. The are of code seems to be in: Content.pm around where the eval{... } and check for LIBXML occurs; $node is returned empty from the parse attempt. Replacing &nbsp; for &#160; runs through valid as xhtml. Basically, if $node comes back empty from the eval, I the parse again, but via the html method, and it comes in as xhtml what appears to be properly. Something along the lines of the following should work -- once proper error handling has been added: --- /usr/lib/perl5/site_perl/5.8.8/XML/Atom/Content.pm 2009-02-11 12:32:36.000000000 -0500 +++ /home/vargo/tmp/Content.pm-vargo 2010-02-11 12:32:58.000000000 -0500 @@ -63,6 +63,13 @@ if $xp; } }; + + if (! $node) { + my $parser = XML::LibXML->new; + my $tree = $parser->parse_html_string($copy); + $node = $tree->getDocumentElement; + } + if (!$@ && $node) { $elem->appendChild($node); if ($content->version == 0.3) { Thanks, Kevin