Skip Menu |

This queue is for tickets about the HTML-TableExtract CPAN distribution.

Report information
The Basics
Id: 76073
Status: new
Priority: 0/
Queue: HTML-TableExtract

People
Owner: Nobody in particular
Requestors: EDAVIS [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 2.10
Fixed in: (no value)



Subject: Inconsistent treatment of whitespace for header matching - string vs regexp
When matching a table header against a string, HTML::TableExtract strips leading and trailing whitespace from the header before comparing it. But when matching against a regexp it does not. So if you have code that matches using strings and you change it over to using regexps, it can happen that a regexp does not match the HTML document, even though it matches the literal string you used before and that literal string matches the document. This should make it a bit clearer what I mean: #!/usr/bin/perl use warnings; use strict; use 5.014; use HTML::TableExtract; my $html = <<END <html> <body> <table> <tr><th> a </th><th> b </th></tr> <tr><td>x</td><td>y</td></tr> </table> </body> </html> END ; foreach my $ref ([ 'a', 'b' ], [ qr/a/, qr/b/ ], [ qr/\Aa\z/, qr/\Ab\z/ ]) { my @headers = @$ref; my $te = new HTML::TableExtract headers => \@headers; $te->parse($html); my $found = scalar $te->tables; say "@headers: $found"; } Here although 'a', 'b' works to extract the table, the corresponding regexps to match the same exact strings do not work. I can see that it's just possible this could be deliberate: somebody might really want to match a table based on leading or trailing whitespace in headers, and regexps let them do that. But that seems so far-fetched that I think it is better to be consistent. I suggest you strip leading and trailing whitespace from the header before doing a regexp match, just as is done before a string match.