Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 84526
Status: open
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: cafe01 [...] gmail.com
david.storrs [...] gmail.com
jeffrey.lerman [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: HTML5 Parsing
Date: Tue, 9 Apr 2013 09:54:21 -0300
To: bug-html-tree [...] rt.cpan.org
From: Cafe Avila Gratz <cafe01 [...] gmail.com>
First of all, thank you for this great module. Now the issue. I'm using HTML::TreeBuilder (version 5.03) to parse this html snippet: <header><h1>foo</h1><p>bar</p></header> And the dump() of it is: <html> @0 (IMPLICIT) <head> @0.0 (IMPLICIT) <body> @0.1 (IMPLICIT) <h1> @0.1.0 "foo" <p> @0.1.1 "bar" <header> @0.2 $tree->guts->as_HTML() is: <div><h1>foo</h1><p>bar<header></header></div> instead of <div><header><h1>foo</h1><p>bar</header></div> Tested with this code: use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new( ); $tree->ignore_unknown(0); $tree->parse_content('<header><h1>foo</h1><p>bar</p></header>'); $tree->dump; printf "HTML:\n%s\n", $tree->guts->as_HTML; Thank you. Carlos Fernando Avila Gratz.
Subject: <section> tags are ignored by HTML::TreeBuilder unless ignore_unknown is set
Date: Sun, 14 Apr 2013 13:29:05 -0700
To: bug-HTML-Tree [...] rt.cpan.org
From: Jeffrey Lerman <jeffrey.lerman [...] gmail.com>
I am using: HTML-Tree-5.03 <http://search.cpan.org/%7Ecjm/HTML-Tree-5.03/> Perl 5.10.1 Debian Linux stable I found that Parse::Tree seems to ignore <section> tags unless ignore_unknown is set to true. With the default value (false), section tags are omitted in as_HTML output; when it is turned on, they appear properly. Thanks, --Jeff Lerman
Subject: HTML::Element deletes 'article' tags
Date: Sat, 13 Aug 2016 17:07:23 -0700
To: bug-HTML-Tree [...] rt.cpan.org
From: David Storrs <david.storrs [...] gmail.com>
HTML::Element will remove 'article' tags from a page. #!/usr/bin/env perl use warnings; use strict; use LWP::UserAgent; use HTML::TreeBuilder 5 -weak; my $root = HTML::TreeBuilder->new_from_content( LWP::UserAgent->new( 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0' )->get("https://slashdot.org")->decoded_content ); # No output print "Article tags: ", $_->as_HTML . "\n\n" for $root->look_down( _tag => "article", ); print "$_\n" for $root->as_HTML =~ m|(</article>)|g; # Also no output # Now go View Source in your browser on https://slashdot.org. Note # that there are multiple <article> tags, one for each story.
On Sat Aug 13 20:07:38 2016, david.storrs@gmail.com wrote: Show quoted text
> HTML::Element will remove 'article' tags from a page. > > > #!/usr/bin/env perl > > use warnings; > use strict; > > use LWP::UserAgent; > use HTML::TreeBuilder 5 -weak; > > my $root = HTML::TreeBuilder->new_from_content( > LWP::UserAgent->new( > 'User-Agent' => > 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) > Gecko/20100101 Firefox/47.0' > )->get("https://slashdot.org")->decoded_content > ); > > # No output > print "Article tags: ", $_->as_HTML . "\n\n" for $root->look_down( > _tag => "article", > ); > print "$_\n" for $root->as_HTML =~ m|(</article>)|g; # Also no output > > > # Now go View Source in your browser on https://slashdot.org. Note > # that there are multiple <article> tags, one for each story.
It applies to any tag type it does not know about. It deletes <abc> tags, too.
Subject: Re: [rt.cpan.org #116940] HTML::Element deletes 'article' tags
Date: Sat, 13 Aug 2016 20:23:04 -0700
To: bug-HTML-Tree [...] rt.cpan.org
From: David Storrs <david.storrs [...] gmail.com>
<article> is a legitimate HTML5 tag. Can HTML::Element not handle HTML5 web pages? Here's a list of the new tags: http://www.w3schools.com/html/html5_new_elements.asp For that matter, why does H::E get a vote on what tags are legit and what are not? As long as the HTML is syntactically valid, it should just give it to me. Tools should not take positive action to make the job harder. On Sat, Aug 13, 2016 at 5:12 PM, Father Chrysostomos via RT < bug-HTML-Tree@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=116940 > > > On Sat Aug 13 20:07:38 2016, david.storrs@gmail.com wrote:
> > HTML::Element will remove 'article' tags from a page. > > > > > > #!/usr/bin/env perl > > > > use warnings; > > use strict; > > > > use LWP::UserAgent; > > use HTML::TreeBuilder 5 -weak; > > > > my $root = HTML::TreeBuilder->new_from_content( > > LWP::UserAgent->new( > > 'User-Agent' => > > 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) > > Gecko/20100101 Firefox/47.0' > > )->get("https://slashdot.org")->decoded_content > > ); > > > > # No output > > print "Article tags: ", $_->as_HTML . "\n\n" for $root->look_down( > > _tag => "article", > > ); > > print "$_\n" for $root->as_HTML =~ m|(</article>)|g; # Also no
> output
> > > > > > # Now go View Source in your browser on https://slashdot.org. Note > > # that there are multiple <article> tags, one for each story.
> > It applies to any tag type it does not know about. It deletes <abc> tags, > too. > >
HTML::Tagset needs to support HTML5 before this module can, specifically these functions: lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = \%HTML::Tagset::isKnown; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::canTighten = \%HTML::Tagset::canTighten; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isHeadElement = \%HTML::Tagset::isHeadElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isBodyElement = \%HTML::Tagset::isBodyElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isPhraseMarkup = \%HTML::Tagset::isPhraseMarkup; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isHeadOrBodyElement = \%HTML::Tagset::isHeadOrBodyElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isList = \%HTML::Tagset::isList; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isTableElement = \%HTML::Tagset::isTableElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isFormElement = \%HTML::Tagset::isFormElement; lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::p_closure_barriers = \@HTML::Tagset::p_closure_barriers;
On Tue Aug 16 19:52:22 2016, jfearn wrote: Show quoted text
> HTML::Tagset needs to support HTML5 before this module can, > specifically these functions: > > lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = > \%HTML::Tagset::isKnown;
A more future-compatible approach would be to treat any unknown elements the same way as, say, <a>, so it will not matter if newer HTML versions add new elements. Things will Just Work.
Subject: Re: [rt.cpan.org #84526] HTML5 Parsing
Date: Mon, 22 Aug 2016 20:30:10 +1000
To: bug-HTML-Tree [...] rt.cpan.org
From: Jeff Fearn <jefffearn [...] gmail.com>
On 17/08/2016 11:34 PM, Father Chrysostomos via RT wrote: Show quoted text
> Queue: HTML-Tree > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > > > On Tue Aug 16 19:52:22 2016, jfearn wrote:
>> HTML::Tagset needs to support HTML5 before this module can, >> specifically these functions: >> >> lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = >> \%HTML::Tagset::isKnown;
> > A more future-compatible approach would be to treat any unknown elements the same way as, say, <a>, so it will not matter if newer HTML versions add new elements. Things will Just Work. >
This would invalidate the assumption that by default we output valid HTML. I don't think using ignore_unknown for stuff we don't support yet is a huge burden on the user. Cheers, Jeff.
On Mon Aug 22 06:30:33 2016, jefffearn@gmail.com wrote: Show quoted text
> On 17/08/2016 11:34 PM, Father Chrysostomos via RT wrote:
> > Queue: HTML-Tree > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > > > > > On Tue Aug 16 19:52:22 2016, jfearn wrote:
> >> HTML::Tagset needs to support HTML5 before this module can, > >> specifically these functions: > >> > >> lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = > >> \%HTML::Tagset::isKnown;
> > > > A more future-compatible approach would be to treat any unknown > > elements the same way as, say, <a>, so it will not matter if newer > > HTML versions add new elements. Things will Just Work. > >
> > This would invalidate the assumption that by default we output valid > HTML. > > I don't think using ignore_unknown for stuff we don't support yet is a > huge burden on the user. > > Cheers, Jeff.
Just as a data point, web browsers allow custom tag names. Try putting this snippet in an HTML file: <script> var x = document.createElement("div"); x.innerHTML="<fwack><fwomp squink=squonk></fwomp></fwack>"; alert(x.innerHTML); alert(x.firstChild.tagName) </script>
Subject: Re: [rt.cpan.org #84526] HTML5 Parsing
Date: Tue, 23 Aug 2016 09:34:57 +1000
To: bug-HTML-Tree [...] rt.cpan.org
From: Jeff Fearn <jefffearn [...] gmail.com>
On 23/08/2016 6:59 AM, Father Chrysostomos via RT wrote: Show quoted text
> Queue: HTML-Tree > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > > > On Mon Aug 22 06:30:33 2016, jefffearn@gmail.com wrote:
>> On 17/08/2016 11:34 PM, Father Chrysostomos via RT wrote:
>>> Queue: HTML-Tree >>> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84526 > >>> >>> On Tue Aug 16 19:52:22 2016, jfearn wrote:
>>>> HTML::Tagset needs to support HTML5 before this module can, >>>> specifically these functions: >>>> >>>> lib/HTML/TreeBuilder.pm:*HTML::TreeBuilder::isKnown = >>>> \%HTML::Tagset::isKnown;
>>> >>> A more future-compatible approach would be to treat any unknown >>> elements the same way as, say, <a>, so it will not matter if newer >>> HTML versions add new elements. Things will Just Work. >>>
>> >> This would invalidate the assumption that by default we output valid >> HTML. >> >> I don't think using ignore_unknown for stuff we don't support yet is a >> huge burden on the user. >> >> Cheers, Jeff.
> > Just as a data point, web browsers allow custom tag names. Try putting this snippet in an HTML file: > > <script> > var x = document.createElement("div"); > x.innerHTML="<fwack><fwomp squink=squonk></fwomp></fwack>"; > alert(x.innerHTML); > alert(x.firstChild.tagName) > </script> > >
Yeah HTML5 has the notion of "custom elements", which are not validatable and require custom javascript to handle. https://html.spec.whatwg.org/multipage/scripting.html#custom-elements Note I've contacted Andy via RT to see if he has any input on updating HTML::Tagset, if so I might give it a crack as I find the problem interesting. Cheers, Jeff.
Show quoted text
> Note I've contacted Andy via RT to see if he has any input on updating > HTML::Tagset
Did you get any update on this? HTML::Tagset doesn't seem to have had a CPAN release since 2008 which doesn't bode well for this issue.
On 2017-10-19 00:54:43, DRAXIL wrote:
Show quoted text
>
> > Note I've contacted Andy via RT to see if he has any input on
> > updating
> > HTML::Tagset
>
> Did you get any update on this? HTML::Tagset doesn't seem to have had
> a CPAN release since 2008 which doesn't bode well for this issue.

The discussion of most relevance is this one:

https://rt.cpan.org/Ticket/Display.html?id=67299


-- 
- CPAN kentnl@cpan.org
- Gentoo Perl Maintainer kentnl@gentoo.org ( perl@gentoo.org )