Skip Menu |

This queue is for tickets about the HTML-TableExtract CPAN distribution.

Report information
The Basics
Id: 27372
Status: open
Priority: 0/
Queue: HTML-TableExtract

People
Owner: Nobody in particular
Requestors: Marcin.Kasperski [...] mekk.waw.pl
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 1.10
Fixed in: (no value)



Subject: Access to row/cell attributes
It would be nice, if TableExtract (when asked by some parameter) allowed one to access information present in the attributes of <tr> and <td> tags. MOtivation? Well, I am just parsing the table in which I need to extract some URL from construct similar to <tr onclick="window.location.href='<valuable url here>'> Solution of my dreams? Well, if I could (sideways normal columns) define pseudocolumn 'tr/onclick' and get there whatever is in attribute onclick of tr. Similar problem sometimes happen with <td>, there also I faced cases when valuable URL must be digged from attribute.
From: davidrw [...] cpan.org
On Fri Jun 01 13:35:06 2007, Mekk wrote: Show quoted text
> It would be nice, if TableExtract (when asked by some parameter) allowed > one to access information present in the attributes of <tr> and <td> tags.
Attached is a patch (including POD update) and a test file. Patch is against HTML::TableExtract-2.10, and test suite passes before & after (v5.6.1; Linux 2.4.21-32.0.1.EL i686 unknown)
#!/usr/bin/perl use strict; use warnings; use Test::More tests => 52; use HTML::TableExtract; my $te = HTML::TableExtract->new( ); my $html = do{ local $/ = undef; <DATA> }; ok($te->parse($html), "parse_file"); my @t = $te->tables; is(@t, 2, "extract count"); { my $ts = $t[1]; ok($ts, "===outer table==="); is(join(',',$ts->coords),'0,0','coords'); my @rows = $ts->rows; my $R = scalar @rows; is($R,5,'rows'); my $C = scalar @{$rows[0]}; is($C,3,'cols'); foreach my $r ( 0 .. 3 ){ is( $ts->cell_attr($r)->{foo}, "row$r", "($r) attribs" ); foreach my $c ( 0 .. 2 ){ is( $ts->cell($r,$c), "cell$r-$c", "($r,$c) contents" ); is( $ts->cell_attr($r,$c)->{foo}, "cell$r,$c", "($r,$c) attribs" ); } } } { my $ts = $t[0]; ok($ts, "===inner table==="); is(join(',',$ts->coords),'1,0','coords'); my @rows = $ts->rows; my $R = scalar @rows; is($R,2,'rows'); my $C = scalar @{$rows[0]}; is($C,3,'cols'); foreach my $r ( 0 .. 1 ){ is( $ts->cell_attr($r)->{foo}, "t2row$r", "t2($r) attribs" ); foreach my $c ( 0 .. 2 ){ is( $ts->cell($r,$c), "t2cell$r-$c", "t2($r,$c) contents" ); is( $ts->cell_attr($r,$c)->{foo}, "t2cell$r,$c", "t2($r,$c) attribs" ); } } } __DATA__ <html> <head><title>TableExtract Test HTML</title></head> <body> <table> <tr foo="row0"> <th foo="cell0,0">cell0-0</th> <th foo="cell0,1">cell0-1</th> <th foo="cell0,2">cell0-2</th> </tr> <tr foo="row1"> <td foo="cell1,0">cell1-0</td> <td foo="cell1,1">cell1-1</td> <td foo="cell1,2">cell1-2</td> </tr> <tr foo="row2"> <td foo="cell2,0">cell2-0</td> <td foo="cell2,1">cell2-1</td> <td foo="cell2,2">cell2-2</td> </tr> <tr foo="row3"> <td foo="cell3,0">cell3-0</td> <td foo="cell3,1">cell3-1</td> <td foo="cell3,2">cell3-2</td> </tr> <tr foo="row4"> <td foo="cell4,0" colspan=3> <table> <tr foo="t2row0"> <th foo="t2cell0,0">t2cell0-0</th> <th foo="t2cell0,1">t2cell0-1</th> <th foo="t2cell0,2">t2cell0-2</th> </tr> <tr foo="t2row1"> <td foo="t2cell1,0">t2cell1-0</td> <td foo="t2cell1,1">t2cell1-1</td> <td foo="t2cell1,2">t2cell1-2</td> </tr> </table> </td> </tr> </table> </body> </html>
*** ../HTML-TableExtract-2.10/lib/HTML/TableExtract.pm Sat Jul 15 19:52:34 2006 --- lib/HTML/TableExtract.pm Sat Jan 12 19:05:33 2008 *************** *** 125,135 **** my $skiptag = 0; if ($_[0] eq 'tr') { $ts->_enter_row; ++$skiptag; } elsif ($_[0] eq 'td' || $_[0] eq 'th') { $ts->_enter_cell(@_); ! my %attrs = ref $_[1] ? %{$_[1]} : {}; my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; $ts->_rasterizer->($ts->row_count, $rspan, $cspan); --- 125,138 ---- my $skiptag = 0; if ($_[0] eq 'tr') { $ts->_enter_row; + my %attrs = ref $_[1] ? %{$_[1]} : (); + $ts->{cell_attribs}->{ $ts->{rc} }->{tr} = \%attrs if scalar keys %attrs; ++$skiptag; } elsif ($_[0] eq 'td' || $_[0] eq 'th') { $ts->_enter_cell(@_); ! my %attrs = ref $_[1] ? %{$_[1]} : (); ! $ts->{cell_attribs}->{ $ts->{rc} }->{ $ts->{cc} } = \%attrs if scalar keys %attrs; my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; $ts->_rasterizer->($ts->row_count, $rspan, $cspan); *************** *** 454,459 **** --- 457,463 ---- children => [], captured => 0, debug => 0, + cell_attribs => {}, }; $self->{_rastamon} = HTML::TableExtract::Rasterize->make_rasterizer(); *************** *** 740,746 **** } ++$self->{cc}; ++$self->{in_cell}; ! my %attrs = ref $_[1] ? %{$_[1]} : {}; my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; } --- 744,750 ---- } ++$self->{cc}; ++$self->{in_cell}; ! my %attrs = ref $_[1] ? %{$_[1]} : (); my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; } *************** *** 911,916 **** --- 915,929 ---- $self->_cell_to_content($row->[$c]); } + sub cell_attr { + my $self = shift; + my($r, $c) = @_; + $c = 'tr' unless defined $c; + return unless exists $self->{cell_attribs}->{$r}; + return unless exists $self->{cell_attribs}->{$r}->{$c}; + return $self->{cell_attribs}->{$r}->{$c}; + } + sub _cell_to_content { my $self = shift; @_ or croak "cell item required\n"; *************** *** 1691,1696 **** --- 1704,1719 ---- covered due to rowspan or colspan issues, in which case the content of the covering cell is returned rather than undef. + =item cell_attr($row,$col) + + Return a hashref of HTML attributes for the TD/TH element. + Returns undef if no attributes. + + =item cell_attr($row) + + Return a hashref of HTML attributes for the TR element. + Returns undef if no attributes. + =item depth() Return the depth at which this table was found.