Bug #7319 for HTML-TableExtract: Feature request: select tables by number of columns

Wed Aug 11 05:30:19 2004 Guest - Ticket created

Subject:

Feature request: select tables by number of columns

Feature request: allow table selection by the number of columns the table has. Example: new HTML::TableExtract(columns => 4);

Thu Feb 24 22:59:37 2005 MSISK [...] cpan.org - Taken

Sat Jan 12 20:16:48 2008 davidrw [...] cpan.org - Correspondence added

Subject:	[PATCH] Feature request: select tables by number of columns
From:	davidrw [...] cpan.org

On Wed Aug 11 05:30:19 2004, guest wrote: Show quoted text

> Feature request: allow table selection by the number of columns the > table has. > Example: > new HTML::TableExtract(columns => 4);

Attached is a patch (including POD update) and a test file.

*** ../HTML-TableExtract-2.10/lib/HTML/TableExtract.pm Sat Jul 15 19:52:34 2006 --- lib/HTML/TableExtract.pm Sat Jan 12 19:07:17 2008 *************** *** 52,57 **** --- 52,58 ---- depth => undef, count => undef, attribs => undef, + columns => undef, subtables => undef, gridmap => 1, decode => 1, *************** *** 317,328 **** keep_html => $self->{keep_html}, strip_html_on_match => $self->{strip_html_on_match}, parent_table => $pts, ); # Target constraints. There is no point in passing any of these along # if we are under an umbrella. Notice that with table states, "depth" # and "count" are absolute coordinates recording where this table was ! # created, whereas "tdepth" and "tcount" are the target constraints. # Headers have "absolute" meaning, therefore are passed by the # same name. if (!$umbrella) { --- 318,330 ---- keep_html => $self->{keep_html}, strip_html_on_match => $self->{strip_html_on_match}, parent_table => $pts, + tcolumns => $self->{columns}, ); # Target constraints. There is no point in passing any of these along # if we are under an umbrella. Notice that with table states, "depth" # and "count" are absolute coordinates recording where this table was ! # created, whereas "tdepth", "tcount", and "tcolumns" are the target constraints # Headers have "absolute" meaning, therefore are passed by the # same name. if (!$umbrella) { *************** *** 356,361 **** --- 358,365 ---- $ts->_exit_row; } + $ts->{columns} = scalar @{ $ts->{grid}->[0] }; + # transform from tree to grid using our rasterized template $ts->_grid_map(); *************** *** 438,444 **** my $class = ref($that) || $that; # Note: # - 'depth' and 'count' are where this table were found. ! # - 'tdepth' and 'tcount' are target constraints on which to trigger. # - 'headers' represent a target constraint, location independent. # - 'attribs' represent target table tag constraints my $self = { --- 442,449 ---- my $class = ref($that) || $that; # Note: # - 'depth' and 'count' are where this table were found. ! # - 'columns' is the number of columns in this table. ! # - 'tdepth', 'tcount', and 'tcolumns' are target constraints on which to trigger. # - 'headers' represent a target constraint, location independent. # - 'attribs' represent target table tag constraints my $self = { *************** *** 447,452 **** --- 452,458 ---- in_cell => 0, rc => -1, cc => -1, + columns => 0, grid => [], translation => [], hrow => [], *************** *** 569,581 **** --- 575,596 ---- sub _check_dtrigger { # depth my $self = shift; + return 1 if $self->{umbrella}; return 1 unless defined $self->{tdepth}; $self->{tdepth} == $self->{depth} ? 1 : 0; } + sub _check_columns_trigger { + # depth + my $self = shift; + return 1 unless defined $self->{tcolumns}; + $self->{tcolumns} == $self->{columns} ? 1 : 0; + } + sub _check_ctrigger { # count my $self = shift; + return 1 if $self->{umbrella}; return 1 unless defined $self->{tcount}; return 1 if (exists $self->{counts}[$self->{depth}] && $self->{tcount} == $self->{counts}[$self->{depth}]); *************** *** 585,590 **** --- 600,606 ---- sub _check_atrigger { # attributes my $self = shift; + return 1 if $self->{umbrella}; return 1 unless scalar keys %{$self->{tattribs}}; return 0 unless scalar keys %{$self->{attribs}}; my $a_hit = 1; *************** *** 690,700 **** sub _check_triggers { my $self = shift; ! return 1 if $self->{umbrella}; ! $self->_check_dtrigger && ! $self->_check_ctrigger && ! $self->_check_atrigger && ! $self->_check_htrigger; } ### Maintain table context --- 706,718 ---- sub _check_triggers { my $self = shift; ! return ! $self->_check_dtrigger ! && $self->_check_ctrigger ! && $self->_check_atrigger ! && $self->_check_columns_trigger ! && $self->_check_htrigger ! ; } ### Maintain table context *************** *** 1327,1335 **** objects. Tables can be extracted as text, HTML, or HTML::ElementTable structures (for in-place editing or manipulation). ! There are currently four constraints available to specify which tables you would like to extract from a document: I<Headers>, I<Depth>, ! I<Count>, and I<Attributes>. I<Headers>, the most flexible and adaptive of the techniques, involves specifying text in an array that you expect to appear above the data in --- 1345,1353 ---- objects. Tables can be extracted as text, HTML, or HTML::ElementTable structures (for in-place editing or manipulation). ! There are currently five constraints available to specify which tables you would like to extract from a document: I<Headers>, I<Depth>, ! I<Count>, I<Columns>, and I<Attributes>. I<Headers>, the most flexible and adaptive of the techniques, involves specifying text in an array that you expect to appear above the data in *************** *** 1357,1366 **** starting with 0. Providing both a I<depth> and a I<count> will uniquely specify a table within a document. I<Attributes> match based on the attributes of the html E<lt>tableE<gt> tag, for example, boder widths or background color. ! Each of the I<Headers>, I<Depth>, I<Count>, and I<Attributes> specifications are cumulative in their effect on the overall extraction. For instance, if you specify only a I<Depth>, then you get all tables at that depth (note that these could very well reside in separate higher- --- 1375,1386 ---- starting with 0. Providing both a I<depth> and a I<count> will uniquely specify a table within a document. + I<Columns> matches on tables with exactly N columns. + I<Attributes> match based on the attributes of the html E<lt>tableE<gt> tag, for example, boder widths or background color. ! Each of the I<Headers>, I<Depth>, I<Count>, I<Columns>, and I<Attributes> specifications are cumulative in their effect on the overall extraction. For instance, if you specify only a I<Depth>, then you get all tables at that depth (note that these could very well reside in separate higher- *************** *** 1369,1379 **** all depths are returned (i.e., the I<n>th occurrence of a table at each depth). If you only specify I<Headers>, then you get all tables in the document containing those column headers. If you have specified multiple ! constraints of I<Headers>, I<Depth>, I<Count>, and I<Attributes>, then each constraint has veto power over whether a particular table is extracted. ! If no I<Headers>, I<Depth>, I<Count>, or I<Attributes> are specified, then all tables match. When extracting only text from tables, the text is decoded with --- 1389,1399 ---- all depths are returned (i.e., the I<n>th occurrence of a table at each depth). If you only specify I<Headers>, then you get all tables in the document containing those column headers. If you have specified multiple ! constraints of I<Headers>, I<Depth>, I<Count>, I<Columns>, and I<Attributes>, then each constraint has veto power over whether a particular table is extracted. ! If no I<Headers>, I<Depth>, I<Count>, I<Columns>, or I<Attributes> are specified, then all tables match. When extracting only text from tables, the text is decoded with

#!/usr/bin/perl use strict; use warnings; use Test::More tests => 24; use HTML::TableExtract; my $html = do{ local $/ = undef; <DATA> }; my $te; sub do_parse { my $label = shift; my $expected_table_ct = shift; my $te = HTML::TableExtract->new( @_ ); isa_ok($te, 'HTML::TableExtract', "[$label] got obj"); ok($te->parse($html), "[$label] parse_file"); my @t = $te->tables; is(@t, $expected_table_ct, "[$label] extract count"); return $te; } my @t; do_parse("no options", 4 ); do_parse("col=2", 1, columns => 2 ); do_parse("col=3", 2, columns => 3 ); do_parse("col=4", 1, columns => 4 ); do_parse("col=6", 0, columns => 6 ); do_parse("head1,2", 2, headers => [qw/head1 head2/] ); do_parse("head1,2;col=3", 1, headers => [qw/head1 head2/], columns => 3 ); do_parse("head1,2;col=4", 1, headers => [qw/head1 head2/], columns => 4 ); __DATA__ <html> <head><title>TableExtract Test HTML</title></head> <body> <table> <tr><th>head1</th><th>head2</th><th>head3</th></tr> <tr><td>col1</td><td>col2</td><td>col3</td></tr> <tr><td>col1</td><td>col2</td><td>col3</td></tr> </table> <table> <tr><th>head1</th><th>head2</th><th>head3</th><th>head4</th></tr> <tr><td>col1</td><td>col2</td><td>col3</td><td>col4</td></tr> <tr><td>col1</td><td>col2</td><td>col3</td><td>col4</td></tr> </table> <table> <tr foo="row0"> <th foo="cell0,0">cell0-0</th> <th foo="cell0,1">cell0-1</th> <th foo="cell0,2">cell0-2</th> </tr> <tr foo="row1"> <td foo="cell1,0">cell1-0</td> <td foo="cell1,1">cell1-1</td> <td foo="cell1,2">cell1-2</td> </tr> <tr foo="row2"> <td foo="cell2,0">cell2-0</td> <td foo="cell2,1">cell2-1</td> <td foo="cell2,2">cell2-2</td> </tr> <tr foo="row3"> <td foo="cell3,0">cell3-0</td> <td foo="cell3,1">cell3-1</td> <td foo="cell3,2">cell3-2</td> </tr> <tr foo="row4"> <td foo="cell4,0" colspan=3> <table> <tr foo="t2row0"> <th foo="t2cell0,0">t2cell0-0</th> <th foo="t2cell0,1">t2cell0-1</th> </tr> <tr foo="t2row1"> <td foo="t2cell1,0">t2cell1-0</td> <td foo="t2cell1,1">t2cell1-1</td> </tr> </table> </td> </tr> </table> </body> </html>

Sat Jan 12 20:17:03 2008 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 12 20:21:47 2008 davidrw [...] cpan.org - Correspondence added

On Sat Jan 12 20:16:48 2008, DAVIDRW wrote: Show quoted text

> Attached is a patch (including POD update) and a test file.

Patch is against HTML::TableExtract-2.10, and test suite passes before & after (v5.6.1; Linux 2.4.21-32.0.1.EL i686 unknown)