Subject: | Closing P tags entirely absent even in simple document parsing |
Hi,
I've identified a long standing (and arguably high severity) bug
related to paragraph tags.
It's not clear to me how nobody ever noticed this before, but this
block of HTML cannot be parsed properly:
<html>
<head>
<title>My Title</title>
</head>
<body>
<p>This is my image. Click here to go to my website.</p>
</body>
</html>
Without any flags whatsoever (see attached script) the "as_HTML" output
after parsing becomes:
<html>
<head>
<title>My Title</title>
</head>
<body>
<p>This is my image. Click here to go to my website.
</body>
</html>
Note that the closing paragraph tag has now disappeared.
This problem persists even when things become more complex (i.e. adding
more tags does not provide hints for the tool to solve the problem) for
example:
<html>
<head>
<title>My Title</title>
</head>
<body>
<p>This is my image. Click here to go to my website.</p>
<img src="myimage" alt="" />
<p>This is my image. Click here to go to my website.</p>
</body>
</html>
predictably becomes:
<html>
<head>
<title>My Title</title>
</head>
<body>
<p>This is my image. Click here to go to my website.
<img src="myimage" alt="" />
<p>This is my image. Click here to go to my website.
</body>
</html>
Basically it appears that the tool does not consider closing P tags to
exist at all. This is kind of a problem for me as I need to strictly
parse things while performing Inlining (I am the author of CSS::Inliner).
Thoughts? Possible workarounds? We're a little confused over here :)
thanks,
Kevin
Show quoted text
ps> see attachment for test script
Subject: | bugtest1.pl |
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = <<END;
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<title></title>
</head>
<body>
<p>This is my image. Click here to go to my website.</p>
</body>
</html>
END
my $tree = HTML::TreeBuilder->new();
$tree->parse($html);
print $tree->as_HTML();