Sun Jun 10 09:59:12 2007ivacklin [...] cs.helsinki.fi - Ticket created
Subject:
HTML::HeadParser doesn't grok some broken xhtml
Date:
Sun, 10 Jun 2007 16:58:34 +0300
To:
bug-HTML-Parser [...] rt.cpan.org
From:
T Ilmari Vacklin <ivacklin [...] cs.helsinki.fi>
See <http://code-libre.org>. The XHTML has an initial bogus <option>
which is probably why headparser fails to extract any headers.
Wed Nov 05 16:57:07 2008diberri [...] cpan.org - Correspondence added
This also occurs with variations on the <title> tag, such as:
<head>
<title>
some title</title>
</head>
"some title" is essentially ignored. I discovered this using WWW::Mechanize:
use WWW::Mechanize;
my $mech = new WWW::Mechanize();
$mech->get('http://www.umm.edu/patiented/articles/what_other_drugs_used_parkinsons_disease_000051_8.htm');
print $mech->title, "\n";
The expected result is to print "Parkinson's disease", but nothing is
printed at all.
Cheers,
Dave
Wed Nov 05 16:57:09 2008The RT System itself - Status changed from 'new' to 'open'
Mon Nov 17 04:24:04 2008GAAS [...] cpan.org - Correspondence added
On Wed Nov 05 16:57:07 2008, DIBERRI wrote:
Show quoted text
> This also occurs with variations on the <title> tag, such as:
>
> <head>
> <title>
> some title</title>
> </head>
>
> "some title" is essentially ignored.
The problem here was that HTML::HeadParser did not ignore the Unicode BOM in decoded
form. I have commited a change that will fix this (in 3.58).
Mon Nov 17 04:36:43 2008GAAS [...] cpan.org - Status changed from 'open' to 'resolved'