Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 53926
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Jeff.Fearn [...] gmail.com
Requestors: charles_woodc [...] yahoo.co.uk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bug in HTML::TreeBuilder - <link> inside <ol> not parsed
Date: Fri, 22 Jan 2010 22:38:33 +0000 (GMT)
To: bug-html-tree [...] rt.cpan.org
From: charles woodcock <charles_woodc [...] yahoo.co.uk>
Hi, Firstly, thanks for writing and maintaining HTML::Tree as open-source code.  My choice of Perl as a language depended mainly on the presence of that module, so your effort is really appreciated. Using HTML::Tree v3.23. I have noticed that some Google search pages include a <link rel=prefetch> element inside the <ol> element containing the search results.  I think this HTML only gets sent to Firefox users (as the only browser that supports this sort of element), and even then, not on every page. I don't know if this is actually valid HTML5 because I am not familiar with the specification, but I suppose your philosophy would be to accept the mark-up in the same way that Firefox renders it, i.e. by rendering what Google intended. In the meantime, I may as well note that a workaround is just to send a non-Firefox User-Agent header in order to parse Google search results.  (Although I haven't yet tested it) E.g. the following URL contains the particular mark-up: http://www.google.nl/search?hl=nl&q=unive&start=0 Reproduceable test HTML: ---------------- <!doctype html> <head><title>unive - Google zoeken</title></head> <body id=gsr topmargin=3 marginheight=3> <ol><link rel=prefetch href="http://www.unive.nl/"> <li class=g> </ol> ---------------- (N.B. page on Google omits closing <body> tag) I would expect HTML::TreeBuilder to parse the mark-up to allow the <link> element to appear inside a <ol> element.  What it actually does (I think) is to assume the <link> element marks the end of the <ol> element, i.e. <ol></ol><link rel=prefetch href="http://www.unive.nl/"> Yours sincerely,   Charles Woodcock
Hi, the link is being being parsed, it's also being moved in to the HEAD since this is the only place link is legal. You would need to turn off implicit_tags to stop it "correcting" the output. e.g. default with implicit_tags $ perl -e 'use HTML::TreeBuilder;my $tree = HTML::TreeBuilder->new(); $tree->parse(qq|<ol><link rel=prefetch href="http://www.unive.nl/"><li class=g></ol>|); print("\n\n",$tree->as_HTML, "\n");' <html><head><link href="http://www.unive.nl/" rel="prefetch" /></head><body><ol></ol><ul><li class="g"></ul></body></html> e.g. without implicit_tags $ perl -e 'use HTML::TreeBuilder;my $tree = HTML::TreeBuilder->new(implicit_tags => 0); $tree->parse(qq|<ol><link rel=prefetch href="http://www.unive.nl/"><li class=g></ol>|); print("\n\n",$tree->as_HTML, "\n");' <html><head></head><body></body><ol><link href="http://www.unive.nl/" rel="prefetch" /><li class="g"></ol></html> You really can't run on a full html page since without implicit_tags the head and body get duplicated. There is a ticket, https://rt.cpan.org/Ticket/Display.html?id=33063, about making that optional. Cheers, Jeff.
Nothing to be done on this ticket.