Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 83641
Status: new
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: js-bugtraq [...] sigmanetwork.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Parsing documents with head-only tags in the body
Date: Tue, 26 Feb 2013 6:45:54 -0800
To: bug-html-tree [...] rt.cpan.org
From: Joe Seaton <js-bugtraq [...] sigmanetwork.org>
Hello, While working with HTML::ParseTree I recently discovered a particularly unpleasant HTML document that failed to parse as I would have liked due to the presence of a <link> tag in the body. Given that this document is very invalid, I'm not sure whether this should be considered a bug or not, but it seemed worth reporting. A minimal document is as follows: <html><head><title>Title</title></head><body> <form> <p>Before</p> <link> <div>After</div> </form> <span>Outside</span> </body> </html> This results in the following parse tree: <html> @0 <head> @0.0 <title> @0.0.0 "Title" <link /> @0.0.1 <body> @0.1 <form> @0.1.0 <p> @0.1.0.0 "Before" <div> @0.1.1 "After" <span> @0.1.2 "Outside" Notably the div following the link tag is considered a child of the body, rather than the form. For my purposes I care about the contents of the form and nothing else, so I would prefer this div to be contained in the form still. The relevant part of the trace is: Proposing a new LINK under html/body/form. * head element LINK found inside BODY! (Attaching link under head) (Current lineage of pos: LINK under html.) Proposing a new text node (\x0a ) under html/head. (Attaching text node (\x0a ) under head). Proposing a new DIV under html/head. * body-element DIV minimizes HEAD, makes implicit BODY. (Attaching div under body) This seems to be due to line 679 (in v5.03): $self->{'_pos'} = $self->{'_head'} || die "Where'd my head go?"; The code to reproduce this is fairly trivial: use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file($ARGV[0]); $tree->dump; Disabling implicit tags causes the document to be parsed as follows, preserving the location of the following div at the expense of having an extraneous link tag. <html> @0 (IMPLICIT) <html> @0.0 <head> @0.0.0 <title> @0.0.0.0 "Title" <body> @0.0.1 <form> @0.0.1.0 <p> @0.0.1.0.0 "Before" <link /> @0.0.1.0.1 <div> @0.0.1.0.2 "After" <span> @0.0.1.1 "Outside" I hope this is of some interest to you all. many thanks, Joe