Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 99936
Status: open
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: porton [...] narod.ru
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Wrong parsing HTML
Date: Fri, 31 Oct 2014 19:00:35 +0200
To: bug-html-tree [...] rt.cpan.org
From: Victor Porton <porton [...] narod.ru>
File test2.html: [[[ <html> <head> <title>Test</title> </head> <body> <form> <link></link> <input name="x" /> </form> </body> </html> ]]] [[[ #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse_file("test2.html"); print $tree->as_HTML, "\n"; ]]] Result: [[[ <html><head><title>Test</title><link /></head><body><form></form><input name="x" /></body></html> ]]] It closes <form> tag at a wrong place, what makes the <input> outside of the form. Also the <link> tag is placed in a wrong place. The example is based on (stripped down) real HTML code from a third party site. We need to make it working. Yes, the place of <link> tag is wrong, but we need to make it working anyway. I will attempt to fix this error in HTML::TreeBuilder but may need your help. -- Victor Porton - http://portonvictor.org
Subject: Re: [rt.cpan.org #99936] AutoReply: Wrong parsing HTML
Date: Fri, 31 Oct 2014 19:50:10 +0200
To: "bug-HTML-Tree [...] rt.cpan.org" <bug-html-tree [...] rt.cpan.org>
From: Victor Porton <porton [...] narod.ru>
Oh, it is a duplicate of Bug #83641. Well in 83641 it is said "Given that this document is very invalid, I'm not sure whether this should be considered a bug or not, but it seemed worth reporting." But for our company it is important to fix this bug, because we use third party HTML documents which are invalid, but we can't make them valid. So we need it to work even with invalid HTML files. -- Victor Porton - http://portonvictor.org
It should be sufficient to do: $HTML::Tagset::isHeadOrBodyElement{link} = 1; $HTML::Tagset::isHeadElement{link} = undef; after loading HTML::Tree but before parsing. If the HTML has other head-only tags in the body, you can do the same for them. This is messing with global variables, so it'll affect the whole program. You can use 'local' to limit the scope.