Subject: | Parsing documents with head-only tags in the body |
Date: | Tue, 26 Feb 2013 6:45:54 -0800 |
To: | bug-html-tree [...] rt.cpan.org |
From: | Joe Seaton <js-bugtraq [...] sigmanetwork.org> |
Hello,
While working with HTML::ParseTree I recently discovered a particularly unpleasant HTML document that failed to parse as I would have liked due to the presence of a <link> tag in the body.
Given that this document is very invalid, I'm not sure whether this should be considered a bug or not, but it seemed worth reporting.
A minimal document is as follows:
<html><head><title>Title</title></head><body>
<form>
<p>Before</p>
<link>
<div>After</div>
</form>
<span>Outside</span>
</body>
</html>
This results in the following parse tree:
<html> @0
<head> @0.0
<title> @0.0.0
"Title"
<link /> @0.0.1
<body> @0.1
<form> @0.1.0
<p> @0.1.0.0
"Before"
<div> @0.1.1
"After"
<span> @0.1.2
"Outside"
Notably the div following the link tag is considered a child of the body, rather than the form.
For my purposes I care about the contents of the form and nothing else, so I would prefer this div to be contained in the form still.
The relevant part of the trace is:
Proposing a new LINK under html/body/form.
* head element LINK found inside BODY!
(Attaching link under head)
(Current lineage of pos: LINK under html.)
Proposing a new text node (\x0a ) under html/head.
(Attaching text node (\x0a ) under head).
Proposing a new DIV under html/head.
* body-element DIV minimizes HEAD, makes implicit BODY.
(Attaching div under body)
This seems to be due to line 679 (in v5.03):
$self->{'_pos'} = $self->{'_head'} || die "Where'd my head go?";
The code to reproduce this is fairly trivial:
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($ARGV[0]);
$tree->dump;
Disabling implicit tags causes the document to be parsed as follows, preserving the location of the following div at the expense of having an extraneous link tag.
<html> @0 (IMPLICIT)
<html> @0.0
<head> @0.0.0
<title> @0.0.0.0
"Title"
<body> @0.0.1
<form> @0.0.1.0
<p> @0.0.1.0.0
"Before"
<link /> @0.0.1.0.1
<div> @0.0.1.0.2
"After"
<span> @0.0.1.1
"Outside"
I hope this is of some interest to you all.
many thanks,
Joe