Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 78256
Status: rejected
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: KAMELKEV [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 3.23
  • 4.2
  • 5.02
Fixed in: (no value)



Subject: Closing P tags entirely absent even in simple document parsing
Hi, I've identified a long standing (and arguably high severity) bug related to paragraph tags. It's not clear to me how nobody ever noticed this before, but this block of HTML cannot be parsed properly: <html> <head> <title>My Title</title> </head> <body> <p>This is my image. Click here to go to my website.</p> </body> </html> Without any flags whatsoever (see attached script) the "as_HTML" output after parsing becomes: <html> <head> <title>My Title</title> </head> <body> <p>This is my image. Click here to go to my website. </body> </html> Note that the closing paragraph tag has now disappeared. This problem persists even when things become more complex (i.e. adding more tags does not provide hints for the tool to solve the problem) for example: <html> <head> <title>My Title</title> </head> <body> <p>This is my image. Click here to go to my website.</p> <img src="myimage" alt="" /> <p>This is my image. Click here to go to my website.</p> </body> </html> predictably becomes: <html> <head> <title>My Title</title> </head> <body> <p>This is my image. Click here to go to my website. <img src="myimage" alt="" /> <p>This is my image. Click here to go to my website. </body> </html> Basically it appears that the tool does not consider closing P tags to exist at all. This is kind of a problem for me as I need to strictly parse things while performing Inlining (I am the author of CSS::Inliner). Thoughts? Possible workarounds? We're a little confused over here :) thanks, Kevin Show quoted text
ps> see attachment for test script
Subject: bugtest1.pl
#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $html = <<END; <html> <head> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <title></title> </head> <body> <p>This is my image. Click here to go to my website.</p> </body> </html> END my $tree = HTML::TreeBuilder->new(); $tree->parse($html); print $tree->as_HTML();
Hi again, I did some additional checking to see if maybe closing P tags really aren't that important... but in HTML5 they are basically more important than ever. In HTML5 omitting a P tag is allowed by around 25 different elements, but all the remaining (75+) elements require the closing tag to exist. Please let me know what I can do on my side, if anything, to help resolve this. For us it's pretty important because we actually tag css properties directly to the paragraph tags (when appropriate) and the lack of a closing tag causes all sorts of cascade issues. -Kevin
This is not a parsing issue, it's an output issue. An HTML::Element tree never includes end tags (nor does it indicate whether an end tag was present during parsing). As documented, by default as_HTML does not emit end tags for <p>, <li>, <dt>, or <dd> elements. You can change that by supplying the \%optional_end_tags parameter: print $tree->as_HTML(undef, undef, {}); will include end tags for all elements that can have content.
Thanks for the prompt and concise reply, I will update appropriately. -Kevin