Subject: | Patch: Attributes with invalid name omitted from XML output |
Date: | Thu, 20 Oct 2011 12:47:30 +0200 |
To: | bug-HTML-Tree [...] rt.cpan.org |
From: | Zsbán Ambrus <ambrus [...] math.bme.hu> |
Dear maintainers of HTML-Tree,
In HTML-Tree 4.2, if you call the as_XML method of a HTML::Element and
there are attributes with invalid names in the HTML, the method dies.
I attach a patch that changes the behavior of this method to not die
omit those attributes from the output (so you get well-formed XML).
A test case is included in the patch.
Back story. The current behavior was introduced in response to bug
report #23439. However, I think instead of dying it's better to
produce some valid XML output. How the invalid attributes are
represented in this output I don't really care.
I met this issue when I was trying to load some malformed HTML with
XML::Twig (which uses HTML::TreeBuilder as its backend). These
invalid attributes (resulting from missing quotes around the value in
the HTML source) actually occur in a different part of the HTML than
the part I want to extract data from. I could just use the
strict_names option of HTML::Parser in this case, but that's not an
ideal solution in the long term, as that turns the entire element to
text, which is not how browsers interpret invalid attributes like
this. Thus, I add this patch to be able to parse such documents.
I am using HTML-Tree version 4.2 (this patch is based on that),
HTML-Parser version 3.69, and perl 5.14.2 vanilla for x86_64-linux.
Ambrus
Message body is not shown because sender requested not to inline it.