Subject: | Incorrect tokenization in HTML::Parser |
Date: | Sat, 23 Feb 2013 17:42:30 +0000 |
To: | "bug-HTML-Parser [...] rt.cpan.org" <bug-HTML-Parser [...] rt.cpan.org> |
From: | Carl Eklof <ceklof [...] thanxmedia.com> |
Hi Gisle,
First, thank you for all of your huge contributions to Perl over the years!
I've discovered a site (http://www.scotts.com/) that has HTML that HTML-Parser does not tokenize correctly.
Envs (tried on two machines, same results):
* HTML::Parser (3.65 and 3.69)
* Perl 5.14.2, and 5.10.1
* 'full_uname' => 'Linux 449876-app3.blosm.com 2.6.18-238.37.1.el5 #1 SMP Fri Apr 6 13:47:10 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux',
* 'os_distro' => 'Red Hat Enterprise Linux Server release 5.9 (Tikanga) Kernel \\r on an \\m<file:///\\m>',
* 'full_uname' => 'Linux idx02 2.6.43.5-2.fc15.x86_64 #1 SMP Tue May 8 11:09:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux',
* 'os_distro' => 'Fedora release 15',
I'm attaching a representative page. The page came from:
http://www.scotts.com/smg/templates/index.jsp?pageUrl=orthoLanding
The problem seems to occur around the HTML:
<noscript>
<iframe height="0" width="0" style="display:none; visibility:hidden;"
src="//www.googletagmanager.com/ns.html?id=GTM-PVLS"
/>
</noscript>
<script>
I've added some debugging to the HTML::TokeParser::get_tag sub so it looks like:
use Data::Dumper;
sub get_tag
{
my $self = shift;
my $token;
while (1) {
$token = $self->get_token || return undef;
warn "Checking token: [".Dumper($token)."]";
my $type = shift @$token;
next unless $type eq "S" || $type eq "E";
substr($token->[0], 0, 0) = "/" if $type eq "E";
return $token unless @_;
for (@_) {
return $token if $token->[0] eq $_;
}
}
}
I've tried both version 3.65 and 3.69 of HTML::Parser, which both produce the same results. They produce output in the "output" attachment. You can see on like 290 of the output that it is tokenizing almost the entire page after the iframe as one big text blob.
Thanks again,
-Carl
Carl Eklof
CTO @ Blosm Inc.
blosm.com<http://blosm.com/>
424.888.4BEE
Confidentiality Note: This e-mail message and any attachments to it are intended only for the named recipients and may contain confidential information. If you are not one of the intended recipients, please do not duplicate or forward this e-mail message and immediately delete it from your computer. By accepting and opening this email, recipient agrees to keep all information confidential and is not allowed to distribute to anyone outside their organization.
Message body is not shown because it is too large.
Message body is not shown because sender requested not to inline it.
Message body is not shown because sender requested not to inline it.