Bug #18904 for HTML-Parser: HTML::TokeParser confused by self-closing tag without internal space

Tue Apr 25 00:45:21 2006 Guest - Ticket created

Subject:

HTML::TokeParser confused by self-closing tag without internal space

Hello, I enjoy HTML::TokeParser but just today noticed a flaw. When parsing self-closing tags like (which is techincally XHTML), the parser fails to properly identify the tag when there is no internal space. So, for the tag is correctly identified as "br" in the second element of token array returned by ->get_token. But the tag (no intrnal space_ is identified as "br/" in the second element of the token array returned by ->get_token. Note that self-closing tags are not required to have an internal space in the XHTML spec, see http://www.w3.org/TR/xhtml1/#h-4.6 Here is a test case which demonstrates the problem: use strict; use HTML::TokeParser; my $htmlf = "line 1 is here Now line 2 Now line 3 Now line 4"; my $parsed = HTML::TokeParser->new(\$htmlf); while (my $token = $parsed->get_token) { if ($token->[0] eq 'S') { print "start tag: " . $token->[1] . "(full text: '" . $token->[4] . "')\n"; } elsif ($token->[0] eq 'E') { print "end tag: " . $token->[1] . "(full text: '" . $token->[4] . "')\n"; } } This outputs: start tag: br(full text: ' ') start tag: br(full text: ' ') start tag: br/(full text: ' ') This mis-identification of the tag name can cause problems when I'm trying to filter for certain "allowed tags", for example in a message board post, and I have named "br" as an allowed tag. Now I must also identify "br/" as an allowed tag. Hope this makes sense! -Ryan Tate ryantate@ryantate.com

Wed Apr 26 04:08:14 2006 GAAS [...] cpan.org - Status changed from 'new' to 'resolved'

Wed Apr 26 04:31:37 2006 Guest - Correspondence added

I've now uploaded 3.52 with some documentation tweaks that recommend enabling empty_element_tag for TokeParser. The reason it's not the default is that this isn't a backwards compatible change. I tried to make it a default in 3.47, but this broke LWP's test suite, so I'm sure it has the potential of breaking other code as well.

Wed Apr 26 04:31:37 2006 The RT System itself - Status changed from 'resolved' to 'open'

Wed Apr 26 04:34:02 2006 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Jul 04 10:00:37 2006 Guest - Correspondence added

From:

Graham Purnell

In contradiction of your posting, If you look at the link you quote (http://www.w3.org/TR/xhtml1/#h-4.6) you will notice that the W3C XHTML specification most definitely DOES require a space before the trailing slash of empty elements "for authors who wish their XHTML documents to render on existing HTML user agents". This information is contained in appendix C.2 I realise this information is of no help when parsing incorrectly structured XHTML documents; I'm just clarifying the assertion. On Tue Apr 25 00:45:21 2006, guest wrote: Show quoted text

> Hello, > > I enjoy HTML::TokeParser but just today noticed a flaw. When parsing > self-closing tags like (which is techincally XHTML), the parser > fails to properly identify the tag when there is no internal space. > > So, for the tag is correctly identified as "br" in the second > element of token array returned by ->get_token. But the tag (no > intrnal space_ is identified as "br/" in the second element of the

token Show quoted text

> array returned by ->get_token. > > Note that self-closing tags are not required to have an internal space > in the XHTML spec, see > > http://www.w3.org/TR/xhtml1/#h-4.6 > > Here is a test case which demonstrates the problem: > > > use strict; > use HTML::TokeParser; > > my $htmlf = "line 1 is here Now line 2 Now line 3

Now Show quoted text

> line 4"; > > my $parsed = HTML::TokeParser->new(\$htmlf); > > while (my $token = $parsed->get_token) { > if ($token->[0] eq 'S') { > print "start tag: " . $token->[1] . "(full text: '" . $token->[4]

. Show quoted text

> "')\n"; > } > elsif ($token->[0] eq 'E') { > print "end tag: " . $token->[1] . "(full text: '" . $token->[4] .

"')\n"; Show quoted text

> } > } > > This outputs: > > start tag: br(full text: ' ') > start tag: br(full text: ' ') > start tag: br/(full text: ' ') > > This mis-identification of the tag name can cause problems when I'm > trying to filter for certain "allowed tags", for example in a message > board post, and I have named "br" as an allowed tag. Now I must also > identify "br/" as an allowed tag. > > Hope this makes sense! > > -Ryan Tate > ryantate@ryantate.com

Tue Jul 04 10:00:39 2006 The RT System itself - Status changed from 'resolved' to 'open'

Tue Jul 04 20:52:32 2006 Guest - Correspondence added

On Tue Jul 04 10:00:37 2006, guest wrote: Show quoted text

> In contradiction of your posting, If you look at the link you quote > (http://www.w3.org/TR/xhtml1/#h-4.6) you will notice that the W3C XHTML > specification most definitely DOES require a space before the trailing > slash of empty elements "for authors who wish their XHTML documents to > render on existing HTML user agents".

Wrong. It's a guideline for browser compatibility. A guideline is not a requirement. I read this prior to my original post.

Tue Jul 04 20:53:04 2006 Guest - Correspondence added

Show quoted text

> Wrong. It's a guideline for browser compatibility. A guideline is not a > requirement. I read this prior to my original post.

-This was from me, ryantate@ryantate.com

Tue Jul 04 21:05:57 2006 Guest - Correspondence added

From:

Ryan Tate <ryantate [...] ryantate.com>

On Tue Jul 04 10:00:37 2006, guest wrote: Show quoted text

> the W3C XHTML > specification most definitely DOES require a space before the trailing > slash of empty elements "for authors who wish their XHTML documents to > render on existing HTML user agents".

To expand on my prior reply, there are several clear hints that Appendix C exists soley to provide information -- not rules -- to parties interested in being compatible with existing Web browsers. One is that it is called a set of "Guidelines," as mentioned previously. Another is the bold text at the very start of the appendix stating "this appendix is informative." Contrast this with other sections of the spec which lead with "this section is normative." Another hint is that the spec itself uses the br and hr tags self closed with zero spaces under the all-caps header "CORRECT:". This is in section 4.6, and is why I linked to it. Show quoted text

> I realise this information is of no help when parsing incorrectly > structured XHTML documents;

That's not what I'm trying to do. At all. Happy Independence Day. RT, in USA

Fri Jan 12 05:26:56 2007 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'