Bug #78256 for HTML-Tree: Closing P tags entirely absent even in simple document parsing

Fri Jul 06 18:26:12 2012 KAMELKEV [...] cpan.org - Ticket created

Subject:

Closing P tags entirely absent even in simple document parsing

Hi, I've identified a long standing (and arguably high severity) bug related to paragraph tags. It's not clear to me how nobody ever noticed this before, but this block of HTML cannot be parsed properly: <html> <head> <title>My Title</title> </head> <body> This is my image. Click here to go to my website. </body> </html> Without any flags whatsoever (see attached script) the "as_HTML" output after parsing becomes: <html> <head> <title>My Title</title> </head> <body> This is my image. Click here to go to my website. </body> </html> Note that the closing paragraph tag has now disappeared. This problem persists even when things become more complex (i.e. adding more tags does not provide hints for the tool to solve the problem) for example: <html> <head> <title>My Title</title> </head> <body> This is my image. Click here to go to my website. <img src="myimage" alt="" /> This is my image. Click here to go to my website. </body> </html> predictably becomes: <html> <head> <title>My Title</title> </head> <body> This is my image. Click here to go to my website. <img src="myimage" alt="" /> This is my image. Click here to go to my website. </body> </html> Basically it appears that the tool does not consider closing P tags to exist at all. This is kind of a problem for me as I need to strictly parse things while performing Inlining (I am the author of CSS::Inliner). Thoughts? Possible workarounds? We're a little confused over here :) thanks, Kevin Show quoted text

ps> see attachment for test script

Subject:

bugtest1.pl

#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $html = <<END; <html> <head> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <title></title> </head> <body> This is my image. Click here to go to my website. </body> </html> END my $tree = HTML::TreeBuilder->new(); $tree->parse($html); print $tree->as_HTML();

Fri Jul 06 18:36:38 2012 KAMELKEV [...] cpan.org - Correspondence added

Hi again, I did some additional checking to see if maybe closing P tags really aren't that important... but in HTML5 they are basically more important than ever. In HTML5 omitting a P tag is allowed by around 25 different elements, but all the remaining (75+) elements require the closing tag to exist. Please let me know what I can do on my side, if anything, to help resolve this. For us it's pretty important because we actually tag css properties directly to the paragraph tags (when appropriate) and the lack of a closing tag causes all sorts of cascade issues. -Kevin

Fri Jul 06 18:36:39 2012 KAMELKEV [...] cpan.org - Status changed from 'new' to 'open'

Fri Jul 06 18:40:11 2012 cjm [...] cpan.org - Correspondence added

This is not a parsing issue, it's an output issue. An HTML::Element tree never includes end tags (nor does it indicate whether an end tag was present during parsing). As documented, by default as_HTML does not emit end tags for , <li>, <dt>, or <dd> elements. You can change that by supplying the \%optional_end_tags parameter: print $tree->as_HTML(undef, undef, {}); will include end tags for all elements that can have content.

Fri Jul 06 18:40:11 2012 cjm [...] cpan.org - Status changed from 'open' to 'rejected'

Fri Jul 06 19:34:32 2012 KAMELKEV [...] cpan.org - Correspondence added

Thanks for the prompt and concise reply, I will update appropriately. -Kevin

Fri Jul 06 19:34:33 2012 The RT System itself - Status changed from 'rejected' to 'open'

Fri Jul 06 19:41:33 2012 cjm [...] cpan.org - Status changed from 'open' to 'rejected'