Skip Menu |

This queue is for tickets about the HTML-TableParser CPAN distribution.

Report information
The Basics
Id: 52445
Status: open
Priority: 0/
Queue: HTML-TableParser

People
Owner: Nobody in particular
Requestors: perlbug [...] simons-clan.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: HTML::TableParser and <br> don't work as expected
Date: Fri, 4 Dec 2009 14:36:49 -0800
To: bug-HTML-TableParser [...] rt.cpan.org, djerius [...] cpan.org
From: perlbug [...] simons-clan.com
Diab Jerius, I'm using HTML::TableParser to extract data from HTML tables. I've hit a problem with <br> tags being removed without any space being put in their place. The result is words and letters run together making the output incorrect. - Do you have any suggestions on how to fix this? Given an input table like: === <html> <head></head> <body> <table> <tr> <td>foo<br>bar</TD> </tr> <tr> <td>&nbsp</td> <td>z<br>a<br>p<br></td> </tr> </table> </body> </html> === I get the the following output: ./test.pl test.input === using version 0.38 start id = 1 columns [foobar] columns [][zap] === because there are <br> tags, I expect to get: === using version 0.38 start id = 1 columns [foo bar] columns [][z a p] === Here is the test.pl script: === #! /usr/bin/perl -w $| = 1; use HTML::TableParser; use strict; my @reqs = ({ id => qr/./, # all tables start => \&start, # start callback row => \&row, # row callback }); # function callbacks sub start { my ( $id, $line, $udata ) = @_; print "start id = $id\n"; } sub row { my ( $id, $line, $cols, $udata ) = @_; printf "columns [%s]\n", join '][', @{$cols}; } # create parser object my $p = HTML::TableParser->new(\@reqs, { Decode => 1, DecodeNBSP => 1, Trim => 1, Chomp => 1 }); printf "using version %s\n", $HTML::TableParser::VERSION; foreach my $file (@ARGV) { $p->parse_file($file); } ===
On Fri Dec 04 17:37:13 2009, perlbug@simons-clan.com wrote: Show quoted text
> Diab Jerius, > > I'm using HTML::TableParser to extract data from HTML tables. > > I've hit a problem with <br> tags being removed without any > space being put in their place. The result is words and > letters run together making the output incorrect.
The code ignores everything except the table structure elements. To get the code to pay attention to the <br> tags requires that the code be able to accept chunks of text, rather than a single unbroken chunk as it does now. It also needs to know what to do with the <br> (or other tags). I don't have any time right now to implement this, but I'm willing to work on it with you if you'd like. The public API would be augmented to include an extra callback which would be handed an element name and would return a replacement string. If the callback was a hash instead of a function or method call, it would be used as a simple lookup table to replace the element. The user would register which elements they were interested in. Or, I suppose, one could turn on all elements while parsing a column. Internally the HTML::TableParser::Table::text method needs to append rather than replace text in the internal buffer. One would have to make sure that the text buffer is initialized appropriately at the beginning or end of each column. An HTML::TableParser::Table::catchall method needs to be created which would be invoked by HTML::Parser when one of the user requested elements (e.g. <br>) is seen, and which then calls the user callback method and uses the text() method to append the result to the internal text buffer. The catchall method should only be registered with HTML::Parser while within a column, as that's the only time that the extra elements should be parsed. There might be a couple of other issues which pop up after the above is implemented. Diab