Subject: | missing </p> tags |
Date: | Wed, 13 May 2009 10:14:54 -0500 |
To: | bug-HTML-Tree [...] rt.cpan.org |
From: | Dean Karres <dean.karres [...] gmail.com> |
Hi,
I am running HTML-Tree-3.23 on a RHEL 5.3 server. I am using the
Template Toolkit but that happens later in the process.
I have an html file:
##########################################
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>ITG</title>
<link href="index.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="itg-about">
<p>The primary mission of
ITG is to provide state-of-the-art imaging facilities for
researchers at the Institute for Advanced Science and
Technology . This service mission is
accomplished through two facilities: the Microscopy Suite and
the Visualization Laboratory.</p>
<p>A secondary mission of the ITG is to develop advanced imaging
technologies with an emphasis on projects in remote instrument
control and scientific visualization.</p>
</div>
<div class="itg-column-1">
<div id="itg-iotw">
[% PERL %]
print `/old-www/www/exhibits/iotw/new-iotw.cgi`;
[% END %]
</div>
<div id="itg-forum">
[% PERL %]
print `/old-www/www/publications/forums/last-Forum.cgi`;
[% END %]
</div>
</div>
<div class="itg-column-2">
<div id="itg-announcement">
[% PERL %]
print `/old-www/www/publications/announcements/announcements.cgi`;
[% END %]
</div>
<div id="itg-news">
[% PERL %]
print `/old-www/www/publications/news/new-News.cgi`;
[% END %]
</div>
</div>
</body>
</html>
##########################################
I have a script that reads this file and harvests the <BODY> text:
#########################################
#!/usr/bin/perl -w
use strict;
select(STDOUT);
$|++;
use HTML::TreeBuilder;
my $stdinFile = "";
my $tree = HTML::TreeBuilder->new;
$tree->p_strict(1);
$tree->warn(1);
$tree->implicit_tags(1);
$tree->store_comments(1);
my $body = "";
my $tmp = "";
if ($#ARGV < 0)
{
$ARGV[0] = "/www/www/Index.html";
}
if ($ARGV[0] !~ /\.(htm|html|shm|shtml)(#.*)?$/)
{
die "Malformed query string: \"$#ARGV\"\n"
}
die "Not a file\n" if (!-f $ARGV[0] || -z $ARGV[0]);
$tree->parse_file("$ARGV[0]");
#
# harvest the first H1 tag and any sub-H2 tags
#
eval { $body = $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;
die "$ARGV[0] is missing a BODY tag\n" if (! $body);
$tmp = $body->as_HTML;
$tmp =~ s/<body>//i;
$tmp =~ s/<\/body>//i;
print STDOUT $tmp;
$tree->delete();
exit(0);
#########################################
The result of running the script on the html is:
#########################################
<div id="itg-about"><p>The primary mission of the ITG is to provide
state-of-the-art imaging facilities for researchers at the Institute
for Advanced Science and Technology. This service mission is
accomplished through two facilities: the Microscopy Suite and the
Visualization Laboratory.<p>A secondary mission of the ITG is to
develop advanced imaging technologies with an emphasis on projects in
remote instrument control and scientific visualization.</div><div
class="itg-column-1"><div id="itg-iotw"> [% PERL %] print
`/old-www/www/exhibits/iotw/new-iotw.cgi`; [% END %] </div><div
id="itg-forum"> [% PERL %] print
`/old-www/www/publications/forums/last-Forum.cgi`; [% END %]
</div></div><div class="itg-column-2"><div id="itg-announcement"> [%
PERL %] print `/old-www/www/publications/announcements/announcements.cgi`;
[% END %] </div><div id="itg-news"> [% PERL %] print
`/old-www/www/publications/news/new-News.cgi`; [% END %] </div></div>
#########################################
You may note that not quite half-way in is the string:
"Laboratory.<p>A secondary". The "</p>" tag is missing in the result.
I may have missconfigured the script but I thought:
$tree->p_strict(1);
$tree->implicit_tags(1);
would do the trick.
What am I missing?
--
Dean Karres