Skip Menu |

This queue is for tickets about the HTML-TableExtract CPAN distribution.

Report information
The Basics
Id: 17449
Status: new
Priority: 0/
Queue: HTML-TableExtract

People
Owner: Nobody in particular
Requestors: hbryden [...] emergency.qld.gov.au
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Subject: More flexible header-based table selection
Distribution: TableExtract-2.06 Perl: 5.8.3 OS: HP-UX 11.23 (Itanium) The "headers" argument to select a table appears to cause a match based on a subset of a table's columns, i.e. $t = HTML::TableExtract->new( headers => [ '^h1$', '^h2$' ] ); will succeed for tables that merely _include_ headers h1 and h2 anywhere in their initial rows. Now I've just encountered a situation where the table I wanted had columns h1 and h2, but another one int he HTML document had columns h1, h2 and h3. Of course new() matched them both. Just for fun (!) I modified TableExtract.pm (mostly in the _check_htrigger function) to act on an additional boolean argument "strict": if this is enabled, the table is rejected on the first failed column match. By default it is disabled to retain current behaviour. In the attached module, the modified lines are tagged "# HDB". I suggest the following enhancements: (1) Selection of tables based on a "strict" match of the headers. (2) A further permutation might be to select tables based on the _sequence_ of headers as passed by the "headers" argument, e.g. the form $t = HTML::TableExtract->new( headers => [ '^h1$', '^h2$' ] ); would select tables with headers like (x, h1, y, h2) but reject headers like (h2, x, h1).
Subject: TableExtract.pm

Message body is not shown because it is too large.