Skip Menu |

This queue is for tickets about the HTML-Tagset CPAN distribution.

Report information
The Basics
Id: 67299
Status: open
Priority: 0/
Queue: HTML-Tagset

People
Owner: Nobody in particular
Requestors: KENTNL [...] cpan.org
MIROD [...] cpan.org
Cc: JEB [...] cpan.org
AdminCc:

Bug Information
Severity: Important
Broken in: 3.20
Fixed in: (no value)



CC: JEB [...] cpan.org
Subject: Need an HTML5 update
I believe HTML::Tagset needs to be updated for HTML5. That includes new attributes (there is already a ticket on that) but also new elements. As long as they are not listed in HTML::Tagset the new elements are silently discarded. If you want a patch, let me know (I don't need this myself, but I'd put the time into it if it's useful and if the patch gets applied). An example of the problem is shown by the following code: perl -MHTML::TreeBuilder -e'print HTML::TreeBuilder->new_from_content( "<html><head></head><body><section><p>foo</p></body></html>")->as_HTML' <html><head></head><body><p>foo</body></html> If 'section' is added as a tag in %isBodyElement (and in @p_closure_barriers just to be on the safe side), then the result is better: <html><head></head><body><section><p>foo</section></body></html> __ mirod
Any chance of getting HEADER and SECTION added as body elements? It would fix tools like Web::Scraper that rely on this module. If this were on github I'd happily submit a pull request.
If we're going to handle HTML5, it's not going to be through a "Can't you just do X". Adding elements at random and pushing out something isn't a good long-term solution.
On Thu Nov 03 23:13:47 2011, PETDANCE wrote: Show quoted text
> If we're going to handle HTML5, it's not going to be through a "Can't > you just do X". Adding elements at random and pushing out something > isn't a good long-term solution. >
Thanks!
On Fri Nov 04 03:13:47 2011, PETDANCE wrote: Show quoted text
> If we're going to handle HTML5, it's not going to be through a "Can't > you just do X". Adding elements at random and pushing out something > isn't a good long-term solution. >
Andy, do you know what the status of the patch I sent to the libwww mailing list on May 20 with support for HTML 5 elements? I saw no response from you or aytone else. Here 'tis again: --- 3.20/Tagset.pm 2011-05-20 11:50:41.987911127 +0800 +++ new/Tagset.pm 2011-05-20 12:24:38.519497374 +0800 @@ -95,6 +95,7 @@ 'a' => ['href'], 'applet' => ['archive', 'codebase', 'code'], 'area' => ['href'], + 'audio' => ['src'], 'base' => ['href'], 'bgsound' => ['src'], 'blockquote' => ['cite'], @@ -115,10 +116,13 @@ 'object' => ['classid', 'codebase', 'data', 'archive', 'usemap'], 'q' => ['cite'], 'script' => ['src', 'for'], + 'source' => ['src'], 'table' => ['background'], 'td' => ['background'], 'th' => ['background'], 'tr' => ['background'], + 'track' => ['src'], + 'video' => ['poster'], 'xmp' => ['href'], ); @@ -185,6 +189,7 @@ wbr nobr blink font basefont bdo spacer embed noembed + time mark ruby rp rt bdi bdo ); # had: center, hr, table @@ -253,7 +258,7 @@ =cut %isFormElement = map {; $_ => 1 } - qw(input select option optgroup textarea button label); + qw(input select option optgroup textarea button label keygen output progress meter ); =head2 hashset %HTML::Tagset::isBodyElement @@ -285,6 +290,10 @@ table center form + + section nav article aside hgroup figure + param video audio source track canvas + details summary command menu ), keys %isFormElement, keys %isPhraseMarkup, # And everything phrasal @@ -313,7 +322,7 @@ %isKnown = (%isHeadElement, %isBodyElement, map{; $_=>1 } qw( head body html - frame frameset noframes + frame frameset noframes figcaption ~comment ~pi ~directive ~literal )); # that should be all known tags ever ever
I'm not going to "just" add some tags. HTML::Tagset needs a better thought-out plan than just adding tags, because then we're not being HTML4.
I'm glad that won't be an excuse to stop updating this module after "HTML5" is done. http://blog.whatwg.org/html-is-the-new-html5
On Fri Nov 04 00:43:34 2011, LEEDO wrote: Show quoted text
> I'm glad that won't be an excuse to stop updating this module after > "HTML5" is done.
I'm not sure what your point is, but if you want to discuss a solution without sarcasm, I'm glad to do it.
Subject: [rt.cpan.org #67299]
Date: Fri, 10 Oct 2014 20:54:26 +0100
To: bug-HTML-Tagset [...] rt.cpan.org
From: redneb [...] gmx.com
I recently released a haskell library that provides the functionality of %HTML::Tagset::linkElements. It includes support for HTML5. Trying to find a good source with a list of HTML5 link elements/attributes I discovered that there is a proprietary program called XMLmind XML Editor 6.0.0 whose Evaluation Edition [1] contains an XML Schema file for HTML 5 which is BSD licensed. You can grab a copy of that file from [2]. Additionally, I wrote a small haskell program [3] that extracts all link elements from that file. If you don't want to run it yourself, here's its output: a href area href audio src base href blockquote cite button formaction command icon del cite embed src form action html manifest iframe src img src input formaction input src ins cite link href object data q cite script src source src track src video poster video src This is the complete list of tags/attributes whose XML Schema type is xs:anyURI. [1] http://www.xmlmind.com/xmleditor/download.shtml [2] https://github.com/redneb/islink/blob/master/scripts/data/xhtml5.xsd [3] https://github.com/redneb/islink/blob/master/scripts/from_xsd.hs
Hey Andy, did you have anything in mind for a better way for doing this for HTML5? I have a couple of related bugs opened for HTML::TreeBuilder and thought I might give it a crack. Cheers, Jeff.
I hand to say "me too" - but me too. Yes, HTML has become a moving target, but some things are stable; despite the "deprecated" and "obsolete" status of some elements, the fact is that browsers don't remove them. So it agglomerates stuff. I don't quite understand the comment from a few years back "we need a plan" - "cant just add tags because then we're not HTML4". Things kept being added to HTML4 browsers when HTML5 was being, er, evolved. And HTML5 is following the same track - it doesn't end, it just evolves. I guess it's now "WHATWG HTML Lifing Standard"... So, what's wrong with adding the tags that exist in the wild? It's not perfect, but then the ticket has been open for almost 7 years; HTML is out there... My interest is as a user of HTML::TreeBuilder. I may be missing why sticking with the "HTML" set of tags is advantageous. Is purity in some sense trumping the practical problem of keeping up with today's (and tomorrow's) content?
Subject: Re: [rt.cpan.org #67299] Need an HTML5 update
Date: Wed, 10 May 2017 16:13:04 -0500
To: bug-HTML-Tagset [...] rt.cpan.org
From: Andy Lester <andy [...] petdance.com>
Show quoted text
> So, what's wrong with adding the tags that exist in the wild?
Because people who expect HTML::Tagset to be HTML4 will have it changed out from under them.
I've just spent a while hunting for "how to parse HTML5 with TreeBuilder", and came across this thread. Ideas for Tagset to support both HTML4 and HTML5: 1) HTML::Tagset::v4, HTML::Tagset::v5 modules (with ::Tagset defaulting to v4, if we need to not confuse users assuming 4), load v5 via setting/variable. 2) One module, that internally outputs v4 or v5 lists based on a setting/variable 3) Fork the whole thing for v5 and leave existing code as is 4) ... ? I'm also volunteering to help out, the various HTML parsing modules have quite a number of reported bugs that could do with tidying.
Show quoted text
> 1) HTML::Tagset::v4, HTML::Tagset::v5 modules (with ::Tagset > defaulting to v4, if we need to not confuse users assuming 4), load v5 > via setting/variable. > > 2) One module, that internally outputs v4 or v5 lists based on a > setting/variable > > 3) Fork the whole thing for v5 and leave existing code as is > > 4) ... ? > > I'm also volunteering to help out, the various HTML parsing modules > have quite a number of reported bugs that could do with tidying.
Those all sound like fine possibilities. I'd love to see someone take action to get it updated, since it's clearly something people want.
On Sun Sep 08 01:03:10 2019, PETDANCE wrote: Show quoted text
>
> > 1) HTML::Tagset::v4, HTML::Tagset::v5 modules (with ::Tagset > > defaulting to v4, if we need to not confuse users assuming 4), load > > v5 > > via setting/variable. > > > > 2) One module, that internally outputs v4 or v5 lists based on a > > setting/variable > > > > 3) Fork the whole thing for v5 and leave existing code as is > > > > 4) ... ? > > > > I'm also volunteering to help out, the various HTML parsing modules > > have quite a number of reported bugs that could do with tidying.
> > > Those all sound like fine possibilities. I'd love to see someone take > action to get it updated, since it's clearly something people want.
Thanks for the comment - is there a repo for this distro? I couldn't see one on your github account.
Subject: Re: [rt.cpan.org #67299] Need an HTML5 update
Date: Mon, 9 Sep 2019 08:44:19 -0500
To: bug-HTML-Tagset [...] rt.cpan.org
From: Andy Lester <andy [...] petdance.com>
Show quoted text
> On Sep 9, 2019, at 3:42 AM, Jess Robinson via RT <bug-HTML-Tagset@rt.cpan.org> wrote: > > Thanks for the comment - is there a repo for this distro? I couldn't see one on your github account.
As far as I know, there is no repo.
On Mon Sep 09 09:44:42 2019, andy@petdance.com wrote: Show quoted text
> >
> > On Sep 9, 2019, at 3:42 AM, Jess Robinson via RT <bug-HTML- > > Tagset@rt.cpan.org> wrote: > > > > Thanks for the comment - is there a repo for this distro? I couldn't > > see one on your github account.
> > > As far as I know, there is no repo.
I took the gitpan one. This is my initial attempt at v5 support https://github.com/castaway/HTML-Tagset/tree/feature/support_v5
Are you suggesting that it should get released as a new HTML::Tagset? What's different about it? Where's the changelog? What will the effects on current users be if they upgrade? I can't see a diff of the repo because you rearranged files.
On Sat Nov 16 14:51:36 2019, PETDANCE wrote: Show quoted text
> Are you suggesting that it should get released as a new HTML::Tagset? > What's different about it? Where's the changelog? What will the > effects on current users be if they upgrade? I can't see a diff of the > repo because you rearranged files.
I was assuming folk would read the code (its not huge).. Still have to write doc bits like changelog (not claiming its finished at all) .. was hoping folk following this ticket would look and give some initial opinions. Specifically what's different: HTML::Tagset itself loads either HTML::Tagset::v4 or HTML::Tagset::v5, defaulting to v4, loading v5 if $HTML::Tagset::HTML_VERSION is set to 'v5' - see new test for how.
Right now I'm interested in diffs vs. the original. The explanation of the changes are also crucial. I will go and put the code in a repo on github so that we can easily diff it and make our way through the changes. Also, I think we will need to give the module a new version number to indicate that there's been a change in the API.
There's now a repo at https://github.com/petdance/html-tagset Can you please make your changes against the dev branch there so that we can easily see what the diffs are? Thanks, Andy
On Sun Nov 17 16:08:24 2019, PETDANCE wrote: Show quoted text
> There's now a repo at https://github.com/petdance/html-tagset > > Can you please make your changes against the dev branch there so that > we can easily see what the diffs are? > > Thanks, > Andy
That was fun, hope I didnt mess it up .. Now sent as a PR on /dev as requested - NB still needs some work, PR is to allow ease of comparison etc.
RT-Send-CC: redneb [...] gmx.com, JEB [...] cpan.org
For anyone following this ticket, we're discussing the changes for a v5.0.0 at https://github.com/petdance/html-tagset/issues/2