Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 71805
Status: new
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: ambrus [...] math.bme.hu
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Patch: Attributes with invalid name omitted from XML output
Date: Thu, 20 Oct 2011 12:47:30 +0200
To: bug-HTML-Tree [...] rt.cpan.org
From: Zsbán Ambrus <ambrus [...] math.bme.hu>
Dear maintainers of HTML-Tree, In HTML-Tree 4.2, if you call the as_XML method of a HTML::Element and there are attributes with invalid names in the HTML, the method dies. I attach a patch that changes the behavior of this method to not die omit those attributes from the output (so you get well-formed XML). A test case is included in the patch. Back story. The current behavior was introduced in response to bug report #23439. However, I think instead of dying it's better to produce some valid XML output. How the invalid attributes are represented in this output I don't really care. I met this issue when I was trying to load some malformed HTML with XML::Twig (which uses HTML::TreeBuilder as its backend). These invalid attributes (resulting from missing quotes around the value in the HTML source) actually occur in a different part of the HTML than the part I want to extract data from. I could just use the strict_names option of HTML::Parser in this case, but that's not an ideal solution in the long term, as that turns the entire element to text, which is not how browsers interpret invalid attributes like this. Thus, I add this patch to be able to parse such documents. I am using HTML-Tree version 4.2 (this patch is based on that), HTML-Parser version 3.69, and perl 5.14.2 vanilla for x86_64-linux. Ambrus

Message body is not shown because sender requested not to inline it.