Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Net-Whois-Raw CPAN distribution.

Report information
The Basics
Id: 91930
Status: resolved
Priority: 0/
Queue: Net-Whois-Raw

People
Owner: Nobody in particular
Requestors: tlhackque [...] yahoo.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 2.48
Fixed in: (no value)



Subject: Boilerplate stripping
Boilerplate isn't being stripped for tucows domains. Try whois myfairpoint.net. Also, godaddy domains are almost stripped; a couple of lines are left: blank permission of Godaddy.com, LLC. By submitting an inquiry, blank in the "registrant" section. In most cases, GoDaddy.com, LLC Other whois services also produce boilerplate that isn't skipped. It seems that matching exact text is a game of whack-a-mole. I wondered if deleting from the line to 'Last update of WHOIS database' to the end would do better... A couple of other heuristics: Many boilerplate lines begin with "# " or '% '; they could be removed. Non-boilerplate lines are of the form <label>:<eol> or <label>:<value><eol>. Many responses use '<<< Last update of WHOIS database: date <<<<' as an end marker. Given that, this filter seems to work: sub whois_edit { my $text = shift; my @lines; # Remove comment lines $text =~ s/^\s*[%#].*?\r?\n\r?//mg; # These two for html output # $text =~ s/(<)/&lt;/g; # $text =~ s/(>)/&lt;/g; # Process remaining lines @lines = split /\r?\n\r?/, $text; $text = ''; foreach my $line (@lines) { # Stop with 'last update' line - boilerplate follows if( $line =~ /Last update of WHOIS database:/i ) { $text .= "$line\n"; last; } # Valid lines are <tag>:<value>? next unless( $line =~ /^[\w_ -]+:\s*/ ); $text .= "$line\n"; } return $text; } The goal is not to encourage misuse of data, but to ease implementation of scripts that report probles. Of course, maybe one data the new whois will be xml or JSON based... but let's not hold our breath.
Subject: Boilerplate stripping, revised
From: tlhackque [...] yahoo.com
A bit more tuning; it seems some servers (e.g. arin returning verizon ip data) don't use the tag:<value> format. Add heuristic to extract only those lines iff there's at least one with a value. Otherwise, just strip comments and trailer lines. sub whois_edit { my $text = shift; my @lines; $text =~ s/^\s*[%#].*?\r?\n\r?//mg; #HTML # $text =~ s/(<)/&lt;/g; # $text =~ s/(>)/&lt;/g; my $tagged; if( $text =~ /^[\w_ -]+:\s*[\w_-]/m ) { $tagged = 1; } @lines = split /\r?\n\r?/, $text; $text = ''; foreach my $line (@lines) { if( $line =~ /Last update of WHOIS database:/i ) { $text .= "$line\n"; last; } if( $line =~ /^[\w_ -]+:\s*/ || !$tagged ) { $text .= "$line\n"; } } return $text; }
Some common boilerplate regexps will be in 2.55: our @strip_regexps = ( qr{ (.+) ^ (?: \W* Last \s update \s of \s WHOIS \s database | Database \s last \s updated | \W* Whois \s database \s was \s last \s updated \s on ) \b .+ \z }xmsi, ); This will strip out twocows.net, godday.com and many other whois server comments. Stripping of "key: value" or other non-comment blocks is not a trivial thing. It's much easier and reliable to use well known comment signatures for that.