Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 17962
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: LGODDARD [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 3.19
Fixed in: (no value)



Subject: Mis-represents data.
Please see below dumper of an HTML::TokeParser token: compare $VAR1-> [1]->{href} and $VAR1->[4]. The latter is correct. This is for the latest binary for Win32 ActivePerl - which is an old version, I admit. No VC++ here, so I can't say if this is really a current bug or not. $VAR1 = [ 'a', { 'href' => '/index.php? currpage=2&days=1&jobtype=0&keywords=PERL〈=en&orderby=4&task=JobSearc h&xc=0' }, [ 'href' ], '<a href="/index.php? currpage=2&days=1&jobtype=0&keywords=PERL&lang=en&orderby=4&task=JobSea rch&xc=0">' ];
Can't tell if there is anything wrong without a test case that include the HTML that you parses. Please provide a minimal program that demonstrates the bug.
From: lgoddard [...] cpan.org
On Sun Mar 12 17:34:10 2006, GAAS wrote: Show quoted text
> Can't tell if there is anything wrong without a test case that
include Show quoted text
> the HTML that you parses. Please provide a minimal program that > demonstrates the bug.
I've attached a full example with perl code, raw HTML data, and the URI of the (dynamic) data source. Hope that helps. lee

Message body is not shown because it is too large.

The reason "&lang" is expanded is that its an official HTML entity name; see http://www.w3.org/TR/REC-html40/sgml/entities.html#h-24.3.1 Browsers has used to expand entities even if the trailing ";" is missing, but there seems to be an exception for the non-Latin1 entities out-there. I tested this piece of HTML in Firefox/Konqeror: <html> <body> <a href="foo?a=1&eth=1&times=3&lang=4&Gamma=5&lang;=6">foo &lang;&lang=</a> </body> </html> and they both expand "&eth", "&times" and "&lang;" into the corresponding char but leaves "&lang" and "&Gamma" alone. Strangely enough Firefox expands "&lang" outside of the attribute so it actually plays by even more rules. HTML is such a mess!
Subject: Re: [rt.cpan.org #17962] Mis-represents data.
Date: Tue, 21 Mar 2006 15:38:06 +0100
To: bug-HTML-Parser [...] rt.cpan.org
From: Lee Goddard <lee [...] leegoddard.net>
Gisle_Aas via RT wrote: Show quoted text
><URL: http://rt.cpan.org/Ticket/Display.html?id=17962 > > >The reason "&lang" is expanded is that its an official HTML entity >name; see http://www.w3.org/TR/REC-html40/sgml/entities.html#h-24.3.1 > >Browsers has used to expand entities even if the trailing ";" is >missing, but there seems to be an exception for the non-Latin1 >entities out-there. I tested this piece of HTML in Firefox/Konqeror: > > <html> > <body> > <a href="foo?a=1&eth=1&times=3&lang=4&Gamma=5&lang;=6">foo >&lang;&lang=</a> > </body> > </html> > >and they both expand "&eth", "&times" and "&lang;" into the >corresponding char but leaves "&lang" and "&Gamma" alone. Strangely >enough Firefox expands "&lang" outside of the attribute so it actually >plays by even more rules. > >HTML is such a mess! >
HTML: it's getting better all the time (couldn't get much worse), to coin a phrase... If only everyone would agree with the standard. I don't have the energy to track down the URI spec today, but logically (HTML/logic: ha!): the semi-colon in &lang; above ought to be URI-encoded, right? Otherwise it might be interpreted as a new-style delimiter as the ampersand was the old-style delimiter. What should happen when those two appaer together, I duuno. Ho hum. Any thoughts how you might deal with the mess? My vote is to not look for entities in URIs... Cheers lee