Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 46040
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: dean.karres [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: missing </p> tags
Date: Wed, 13 May 2009 10:14:54 -0500
To: bug-HTML-Tree [...] rt.cpan.org
From: Dean Karres <dean.karres [...] gmail.com>
Hi, I am running HTML-Tree-3.23 on a RHEL 5.3 server. I am using the Template Toolkit but that happens later in the process. I have an html file: ########################################## <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>ITG</title> <link href="index.css" rel="stylesheet" type="text/css" /> </head> <body> <div id="itg-about"> <p>The primary mission of ITG is to provide state-of-the-art imaging facilities for researchers at the Institute for Advanced Science and Technology . This service mission is accomplished through two facilities: the Microscopy Suite and the Visualization Laboratory.</p> <p>A secondary mission of the ITG is to develop advanced imaging technologies with an emphasis on projects in remote instrument control and scientific visualization.</p> </div> <div class="itg-column-1"> <div id="itg-iotw"> [% PERL %] print `/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div> <div id="itg-forum"> [% PERL %] print `/old-www/www/publications/forums/last-Forum.cgi`; [% END %] </div> </div> <div class="itg-column-2"> <div id="itg-announcement"> [% PERL %] print `/old-www/www/publications/announcements/announcements.cgi`; [% END %] </div> <div id="itg-news"> [% PERL %] print `/old-www/www/publications/news/new-News.cgi`; [% END %] </div> </div> </body> </html> ########################################## I have a script that reads this file and harvests the <BODY> text: ######################################### #!/usr/bin/perl -w use strict; select(STDOUT); $|++; use HTML::TreeBuilder; my $stdinFile = ""; my $tree = HTML::TreeBuilder->new; $tree->p_strict(1); $tree->warn(1); $tree->implicit_tags(1); $tree->store_comments(1); my $body = ""; my $tmp = ""; if ($#ARGV < 0) { $ARGV[0] = "/www/www/Index.html"; } if ($ARGV[0] !~ /\.(htm|html|shm|shtml)(#.*)?$/) { die "Malformed query string: \"$#ARGV\"\n" } die "Not a file\n" if (!-f $ARGV[0] || -z $ARGV[0]); $tree->parse_file("$ARGV[0]"); # # harvest the first H1 tag and any sub-H2 tags # eval { $body = $tree->look_down('_tag', 'body'); }; die __LINE__ . ": " . $@ if $@; die "$ARGV[0] is missing a BODY tag\n" if (! $body); $tmp = $body->as_HTML; $tmp =~ s/<body>//i; $tmp =~ s/<\/body>//i; print STDOUT $tmp; $tree->delete(); exit(0); ######################################### The result of running the script on the html is: ######################################### <div id="itg-about"><p>The primary mission of the ITG is to provide state-of-the-art imaging facilities for researchers at the Institute for Advanced Science and Technology. This service mission is accomplished through two facilities: the Microscopy Suite and the Visualization Laboratory.<p>A secondary mission of the ITG is to develop advanced imaging technologies with an emphasis on projects in remote instrument control and scientific visualization.</div><div class="itg-column-1"><div id="itg-iotw"> [% PERL %] print `/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div><div id="itg-forum"> [% PERL %] print `/old-www/www/publications/forums/last-Forum.cgi`; [% END %] </div></div><div class="itg-column-2"><div id="itg-announcement"> [% PERL %] print `/old-www/www/publications/announcements/announcements.cgi`; [% END %] </div><div id="itg-news"> [% PERL %] print `/old-www/www/publications/news/new-News.cgi`; [% END %] </div></div> ######################################### You may note that not quite half-way in is the string: "Laboratory.<p>A secondary". The "</p>" tag is missing in the result. I may have missconfigured the script but I thought: $tree->p_strict(1); $tree->implicit_tags(1); would do the trick. What am I missing? -- Dean Karres
Subject: Re: [rt.cpan.org #46040] AutoReply: missing </p> tags
Date: Wed, 13 May 2009 14:11:57 -0500
To: bug-HTML-Tree [...] rt.cpan.org
From: Dean Karres <dean.karres [...] gmail.com>
Sigh, never mind. Why do I find solutions after I submit bug reports... The answer is in the as_HTML method. Several closing tags are optional by default. Giving as_HTML an empty set of optional end tags clears this issue right up. sorry for the noise
Resolved per requestor.