Subject: | Bug in HTML::TreeBuilder - <link> inside <ol> not parsed |
Date: | Fri, 22 Jan 2010 22:38:33 +0000 (GMT) |
To: | bug-html-tree [...] rt.cpan.org |
From: | charles woodcock <charles_woodc [...] yahoo.co.uk> |
Hi,
Firstly, thanks for writing and maintaining HTML::Tree as open-source code. My choice of Perl as a language depended mainly on the presence of that module, so your effort is really appreciated.
Using HTML::Tree v3.23.
I have noticed that some Google search pages include a <link rel=prefetch> element inside the <ol> element containing the search results. I think this HTML only gets sent to Firefox users (as the only browser that supports this sort of element), and even then, not on every page.
I don't know if this is actually valid HTML5 because I am not familiar with the specification, but I suppose your philosophy would be to accept the mark-up in the same way that Firefox renders it, i.e. by rendering what Google intended.
In the meantime, I may as well note that a workaround is just to send a non-Firefox User-Agent header in order to parse Google search results. (Although I haven't yet tested it)
E.g. the following URL contains the particular mark-up:
http://www.google.nl/search?hl=nl&q=unive&start=0
Reproduceable test HTML:
----------------
<!doctype html>
<head><title>unive - Google zoeken</title></head>
<body id=gsr topmargin=3 marginheight=3>
<ol><link rel=prefetch href="http://www.unive.nl/">
<li class=g>
</ol>
----------------
(N.B. page on Google omits closing <body> tag)
I would expect HTML::TreeBuilder to parse the mark-up to allow the <link> element to appear inside a <ol> element. What it actually does (I think) is to assume the <link> element marks the end of the <ol> element, i.e.
<ol></ol><link rel=prefetch href="http://www.unive.nl/">
Yours sincerely,
Charles Woodcock