Bug #84526 for HTML-Tree: HTML5 Parsing

Tue Apr 09 08:54:58 2013 cafe01 [...] gmail.com - Ticket created

Subject:	HTML5 Parsing
Date:	Tue, 9 Apr 2013 09:54:21 -0300
To:	bug-html-tree [...] rt.cpan.org
From:	Cafe Avila Gratz <cafe01 [...] gmail.com>

First of all, thank you for this great module. Now the issue. I'm using HTML::TreeBuilder (version 5.03) to parse this html snippet: <header><h1>foo</h1><p>bar</p></header> And the dump() of it is: <html> @0 (IMPLICIT) <head> @0.0 (IMPLICIT) <body> @0.1 (IMPLICIT) <h1> @0.1.0 "foo" <p> @0.1.1 "bar" <header> @0.2 $tree->guts->as_HTML() is: <div><h1>foo</h1><p>bar<header></header></div> instead of <div><header><h1>foo</h1><p>bar</header></div> Tested with this code: use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new( ); $tree->ignore_unknown(0); $tree->parse_content('<header><h1>foo</h1><p>bar</p></header>'); $tree->dump; printf "HTML:\n%s\n", $tree->guts->as_HTML; Thank you. Carlos Fernando Avila Gratz.

Sun Apr 14 16:29:24 2013 jeffrey.lerman [...] gmail.com - Ticket #84632: Ticket created

Subject:	<section> tags are ignored by HTML::TreeBuilder unless ignore_unknown is set
Date:	Sun, 14 Apr 2013 13:29:05 -0700
To:	bug-HTML-Tree [...] rt.cpan.org
From:	Jeffrey Lerman <jeffrey.lerman [...] gmail.com>

I am using: HTML-Tree-5.03 <http://search.cpan.org/%7Ecjm/HTML-Tree-5.03/> Perl 5.10.1 Debian Linux stable I found that Parse::Tree seems to ignore <section> tags unless ignore_unknown is set to true. With the default value (false), section tags are omitted in as_HTML output; when it is turned on, they appear properly. Thanks, --Jeff Lerman

Sat Aug 13 20:07:38 2016 david.storrs [...] gmail.com - Ticket #116940: Ticket created

Subject:	HTML::Element deletes 'article' tags
Date:	Sat, 13 Aug 2016 17:07:23 -0700
To:	bug-HTML-Tree [...] rt.cpan.org
From:	David Storrs <david.storrs [...] gmail.com>

HTML::Element will remove 'article' tags from a page. #!/usr/bin/env perl use warnings; use strict; use LWP::UserAgent; use HTML::TreeBuilder 5 -weak; my $root = HTML::TreeBuilder->new_from_content( LWP::UserAgent->new( 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0' )->get("https://slashdot.org")->decoded_content ); # No output print "Article tags: ", $_->as_HTML . "\n\n" for $root->look_down( _tag => "article", ); print "$_\n" for $root->as_HTML =~ m|(</article>)|g; # Also no output # Now go View Source in your browser on https://slashdot.org. Note # that there are multiple <article> tags, one for each story.

Sat Aug 13 20:12:04 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Ticket #116940: Reference by ticket #116941 added

Sat Aug 13 20:12:52 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Ticket #116940: Correspondence added

On Sat Aug 13 20:07:38 2016, david.storrs@gmail.com wrote: Show quoted text

> HTML::Element will remove 'article' tags from a page. > > > #!/usr/bin/env perl > > use warnings; > use strict; > > use LWP::UserAgent; > use HTML::TreeBuilder 5 -weak; > > my $root = HTML::TreeBuilder->new_from_content( > LWP::UserAgent->new( > 'User-Agent' => > 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) > Gecko/20100101 Firefox/47.0' > )->get("https://slashdot.org")->decoded_content > ); > > # No output > print "Article tags: ", $_->as_HTML . "\n\n" for $root->look_down( > _tag => "article", > ); > print "$_\n" for $root->as_HTML =~ m|(</article>)|g; # Also no output > > > # Now go View Source in your browser on https://slashdot.org. Note > # that there are multiple <article> tags, one for each story.

It applies to any tag type it does not know about. It deletes <abc> tags, too.

Sat Aug 13 20:12:52 2016 The RT System itself - Ticket #116940: Status changed from 'new' to 'open'

Sat Aug 13 23:23:21 2016 david.storrs [...] gmail.com - Ticket #116940: Correspondence added

Subject:	Re: [rt.cpan.org #116940] HTML::Element deletes 'article' tags
Date:	Sat, 13 Aug 2016 20:23:04 -0700
To:	bug-HTML-Tree [...] rt.cpan.org
From:	David Storrs <david.storrs [...] gmail.com>

<article> is a legitimate HTML5 tag. Can HTML::Element not handle HTML5 web pages? Here's a list of the new tags: http://www.w3schools.com/html/html5_new_elements.asp For that matter, why does H::E get a vote on what tags are legit and what are not? As long as the HTML is syntactically valid, it should just give it to me. Tools should not take positive action to make the job harder. On Sat, Aug 13, 2016 at 5:12 PM, Father Chrysostomos via RT < bug-HTML-Tree@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=116940 > > > On Sat Aug 13 20:07:38 2016, david.storrs@gmail.com wrote:

> > HTML::Element will remove 'article' tags from a page. > > > > > > #!/usr/bin/env perl > > > > use warnings; > > use strict; > > > > use LWP::UserAgent; > > use HTML::TreeBuilder 5 -weak; > > > > my $root = HTML::TreeBuilder->new_from_content( > > LWP::UserAgent->new( > > 'User-Agent' => > > 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) > > Gecko/20100101 Firefox/47.0' > > )->get("https://slashdot.org")->decoded_content > > ); > > > > # No output > > print "Article tags: ", $_->as_HTML . "\n\n" for $root->look_down( > > _tag => "article", > > ); > > print "$_\n" for $root->as_HTML =~ m|(</article>)|g; # Also no

> output

> > > > > > # Now go View Source in your browser on https://slashdot.org. Note > > # that there are multiple <article> tags, one for each story.

> > It applies to any tag type it does not know about. It deletes <abc> tags, > too. > >

Tue Aug 16 19:47:35 2016 Jeff.Fearn [...] gmail.com - Ticket #116940: Merged into ticket #84526

Tue Aug 16 19:47:35 2016 Jeff.Fearn [...] gmail.com - Ticket #84632: Merged into ticket #84526

Tue Aug 16 19:48:13 2016 Jeff.Fearn [...] gmail.com - Ticket #84632: Merged into ticket #84526

Tue Aug 16 19:48:13 2016 Jeff.Fearn [...] gmail.com - Merged into ticket #84526

Tue Aug 16 19:52:22 2016 Jeff.Fearn [...] gmail.com - Correspondence added

HTML::Tagset needs to support HTML5 before this module can, specifically these functions: lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = \%HTML::Tagset::isKnown; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::canTighten = \%HTML::Tagset::canTighten; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isHeadElement = \%HTML::Tagset::isHeadElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isBodyElement = \%HTML::Tagset::isBodyElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isPhraseMarkup = \%HTML::Tagset::isPhraseMarkup; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isHeadOrBodyElement = \%HTML::Tagset::isHeadOrBodyElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isList = \%HTML::Tagset::isList; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isTableElement = \%HTML::Tagset::isTableElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isFormElement = \%HTML::Tagset::isFormElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::p_closure_barriers = \@HTML::Tagset::p_closure_barriers;

Tue Aug 16 19:52:22 2016 The RT System itself - Status changed from 'new' to 'open'

Wed Aug 17 09:34:01 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Tue Aug 16 19:52:22 2016, jfearn wrote: Show quoted text

> HTML::Tagset needs to support HTML5 before this module can, > specifically these functions: > > lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = > \%HTML::Tagset::isKnown;

A more future-compatible approach would be to treat any unknown elements the same way as, say, <a>, so it will not matter if newer HTML versions add new elements. Things will Just Work.

Mon Aug 22 06:30:33 2016 jefffearn [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #84526] HTML5 Parsing
Date:	Mon, 22 Aug 2016 20:30:10 +1000
To:	bug-HTML-Tree [...] rt.cpan.org
From:	Jeff Fearn <jefffearn [...] gmail.com>

On 17/08/2016 11:34 PM, Father Chrysostomos via RT wrote: Show quoted text

> Queue: HTML-Tree > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > > > On Tue Aug 16 19:52:22 2016, jfearn wrote:

>> HTML::Tagset needs to support HTML5 before this module can, >> specifically these functions: >> >> lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = >> \%HTML::Tagset::isKnown;

> > A more future-compatible approach would be to treat any unknown elements the same way as, say, <a>, so it will not matter if newer HTML versions add new elements. Things will Just Work. >

This would invalidate the assumption that by default we output valid HTML. I don't think using ignore_unknown for stuff we don't support yet is a huge burden on the user. Cheers, Jeff.

Mon Aug 22 06:39:18 2016 Jeff.Fearn [...] gmail.com - Dependency on ticket #67299 added

Mon Aug 22 16:59:21 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Mon Aug 22 06:30:33 2016, jefffearn@gmail.com wrote: Show quoted text

> On 17/08/2016 11:34 PM, Father Chrysostomos via RT wrote:

> > Queue: HTML-Tree > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > > > > > On Tue Aug 16 19:52:22 2016, jfearn wrote:

> >> HTML::Tagset needs to support HTML5 before this module can, > >> specifically these functions: > >> > >> lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = > >> \%HTML::Tagset::isKnown;

> > > > A more future-compatible approach would be to treat any unknown > > elements the same way as, say, <a>, so it will not matter if newer > > HTML versions add new elements. Things will Just Work. > >

> > This would invalidate the assumption that by default we output valid > HTML. > > I don't think using ignore_unknown for stuff we don't support yet is a > huge burden on the user. > > Cheers, Jeff.

Just as a data point, web browsers allow custom tag names. Try putting this snippet in an HTML file: <script> var x = document.createElement("div"); x.innerHTML="<fwack><fwomp squink=squonk></fwomp></fwack>"; alert(x.innerHTML); alert(x.firstChild.tagName) </script>

Mon Aug 22 19:35:11 2016 jefffearn [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #84526] HTML5 Parsing
Date:	Tue, 23 Aug 2016 09:34:57 +1000
To:	bug-HTML-Tree [...] rt.cpan.org
From:	Jeff Fearn <jefffearn [...] gmail.com>

On 23/08/2016 6:59 AM, Father Chrysostomos via RT wrote: Show quoted text

> Queue: HTML-Tree > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > > > On Mon Aug 22 06:30:33 2016, jefffearn@gmail.com wrote:

>> On 17/08/2016 11:34 PM, Father Chrysostomos via RT wrote:

>>> Queue: HTML-Tree >>> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > >>> >>> On Tue Aug 16 19:52:22 2016, jfearn wrote:

>>>> HTML::Tagset needs to support HTML5 before this module can, >>>> specifically these functions: >>>> >>>> lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = >>>> \%HTML::Tagset::isKnown;

>>> >>> A more future-compatible approach would be to treat any unknown >>> elements the same way as, say, <a>, so it will not matter if newer >>> HTML versions add new elements. Things will Just Work. >>>

>> >> This would invalidate the assumption that by default we output valid >> HTML. >> >> I don't think using ignore_unknown for stuff we don't support yet is a >> huge burden on the user. >> >> Cheers, Jeff.

> > Just as a data point, web browsers allow custom tag names. Try putting this snippet in an HTML file: > > <script> > var x = document.createElement("div"); > x.innerHTML="<fwack><fwomp squink=squonk></fwomp></fwack>"; > alert(x.innerHTML); > alert(x.firstChild.tagName) > </script> > >

Yeah HTML5 has the notion of "custom elements", which are not validatable and require custom javascript to handle. https://html.spec.whatwg.org/multipage/scripting.html#custom-elements Note I've contacted Andy via RT to see if he has any input on updating HTML::Tagset, if so I might give it a crack as I find the problem interesting. Cheers, Jeff.

Wed Oct 18 07:54:43 2017 DRAXIL [...] cpan.org - Correspondence added

Show quoted text

> Note I've contacted Andy via RT to see if he has any input on updating > HTML::Tagset

Did you get any update on this? HTML::Tagset doesn't seem to have had a CPAN release since 2008 which doesn't bode well for this issue.

Wed Oct 18 15:39:26 2017 KENTNL [...] cpan.org - Correspondence added

On 2017-10-19 00:54:43, DRAXIL wrote:
Show quoted text

>

> > Note I've contacted Andy via RT to see if he has any input on
> > updating
> > HTML::Tagset

>
> Did you get any update on this? HTML::Tagset doesn't seem to have had
> a CPAN release since 2008 which doesn't bode well for this issue.

The discussion of most relevance is this one:

https://rt.cpan.org/Ticket/Display.html?id=67299

--
- CPAN kentnl@cpan.org
- Gentoo Perl Maintainer kentnl@gentoo.org ( perl@gentoo.org )