Subject: | HTML::TableParser and <br> don't work as expected |
Date: | Fri, 4 Dec 2009 14:36:49 -0800 |
To: | bug-HTML-TableParser [...] rt.cpan.org, djerius [...] cpan.org |
From: | perlbug [...] simons-clan.com |
Diab Jerius,
I'm using HTML::TableParser to extract data from HTML tables.
I've hit a problem with <br> tags being removed without any
space being put in their place. The result is words and
letters run together making the output incorrect.
- Do you have any suggestions on how to fix this?
Given an input table like:
===
<html>
<head></head>
<body>
<table>
<tr>
<td>foo<br>bar</TD>
</tr>
<tr>
<td> </td>
<td>z<br>a<br>p<br></td>
</tr>
</table>
</body>
</html>
===
I get the the following output:
./test.pl test.input
===
using version 0.38
start id = 1
columns [foobar]
columns [][zap]
===
because there are <br> tags, I expect to get:
===
using version 0.38
start id = 1
columns [foo bar]
columns [][z a p]
===
Here is the test.pl script:
===
#! /usr/bin/perl -w
$| = 1;
use HTML::TableParser;
use strict;
my @reqs = ({
id => qr/./, # all tables
start => \&start, # start callback
row => \&row, # row callback
});
# function callbacks
sub start {
my ( $id, $line, $udata ) = @_;
print "start id = $id\n";
}
sub row {
my ( $id, $line, $cols, $udata ) = @_;
printf "columns [%s]\n", join '][', @{$cols};
}
# create parser object
my $p = HTML::TableParser->new(\@reqs,
{ Decode => 1, DecodeNBSP => 1, Trim => 1, Chomp => 1 });
printf "using version %s\n", $HTML::TableParser::VERSION;
foreach my $file (@ARGV) {
$p->parse_file($file);
}
===