Subject: | Inconsistent treatment of whitespace for header matching - string vs regexp |
When matching a table header against a string, HTML::TableExtract strips
leading and trailing whitespace from the header before comparing it.
But when matching against a regexp it does not. So if you have code
that matches using strings and you change it over to using regexps, it
can happen that a regexp does not match the HTML document, even though
it matches the literal string you used before and that literal string
matches the document.
This should make it a bit clearer what I mean:
#!/usr/bin/perl
use warnings;
use strict;
use 5.014;
use HTML::TableExtract;
my $html = <<END
<html>
<body>
<table>
<tr><th> a </th><th> b </th></tr>
<tr><td>x</td><td>y</td></tr>
</table>
</body>
</html>
END
;
foreach my $ref ([ 'a', 'b' ], [ qr/a/, qr/b/ ], [ qr/\Aa\z/, qr/\Ab\z/ ]) {
my @headers = @$ref;
my $te = new HTML::TableExtract headers => \@headers;
$te->parse($html);
my $found = scalar $te->tables;
say "@headers: $found";
}
Here although 'a', 'b' works to extract the table, the corresponding
regexps to match the same exact strings do not work.
I can see that it's just possible this could be deliberate: somebody
might really want to match a table based on leading or trailing
whitespace in headers, and regexps let them do that. But that seems so
far-fetched that I think it is better to be consistent. I suggest you
strip leading and trailing whitespace from the header before doing a
regexp match, just as is done before a string match.