Subject: | Boilerplate stripping |
Boilerplate isn't being stripped for tucows domains.
Try whois myfairpoint.net.
Also, godaddy domains are almost stripped; a couple of lines are left:
blank
permission of Godaddy.com, LLC. By submitting an inquiry,
blank
in the "registrant" section. In most cases, GoDaddy.com, LLC
Other whois services also produce boilerplate that isn't skipped.
It seems that matching exact text is a game of whack-a-mole.
I wondered if deleting from the line to 'Last update of WHOIS database' to the end would do better...
A couple of other heuristics:
Many boilerplate lines begin with "# " or '% '; they could be removed.
Non-boilerplate lines are of the form <label>:<eol> or <label>:<value><eol>.
Many responses use '<<< Last update of WHOIS database: date <<<<' as an end marker.
Given that, this filter seems to work:
sub whois_edit {
my $text = shift;
my @lines;
# Remove comment lines
$text =~ s/^\s*[%#].*?\r?\n\r?//mg;
# These two for html output
# $text =~ s/(<)/</g;
# $text =~ s/(>)/</g;
# Process remaining lines
@lines = split /\r?\n\r?/, $text;
$text = '';
foreach my $line (@lines) {
# Stop with 'last update' line - boilerplate follows
if( $line =~ /Last update of WHOIS database:/i ) {
$text .= "$line\n";
last;
}
# Valid lines are <tag>:<value>?
next unless( $line =~ /^[\w_ -]+:\s*/ );
$text .= "$line\n";
}
return $text;
}
The goal is not to encourage misuse of data, but to ease implementation of scripts that report probles.
Of course, maybe one data the new whois will be xml or JSON based... but let's not hold our breath.