Subject: | More flexible header-based table selection |
Distribution: TableExtract-2.06
Perl: 5.8.3
OS: HP-UX 11.23 (Itanium)
The "headers" argument to select a table appears to cause a match based
on a subset of a table's columns, i.e.
$t = HTML::TableExtract->new( headers => [ '^h1$', '^h2$' ] );
will succeed for tables that merely _include_ headers h1 and h2 anywhere
in their initial rows.
Now I've just encountered a situation where the table I wanted had
columns h1 and h2, but another one int he HTML document had columns h1,
h2 and h3. Of course new() matched them both.
Just for fun (!) I modified TableExtract.pm (mostly in the
_check_htrigger function) to act on an additional boolean argument
"strict": if this is enabled, the table is rejected on the first failed
column match. By default it is disabled to retain current behaviour.
In the attached module, the modified lines are tagged "# HDB".
I suggest the following enhancements:
(1) Selection of tables based on a "strict" match of the headers.
(2) A further permutation might be to select tables based on the
_sequence_ of headers as passed by the "headers" argument, e.g. the form
$t = HTML::TableExtract->new( headers => [ '^h1$', '^h2$' ] );
would select tables with headers like (x, h1, y, h2) but reject headers
like (h2, x, h1).
Subject: | TableExtract.pm |
Message body is not shown because it is too large.