Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 18904
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: ryantate [...] ryantate.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 3.51
Fixed in: (no value)



Subject: HTML::TokeParser confused by self-closing tag without internal space
Hello, I enjoy HTML::TokeParser but just today noticed a flaw. When parsing self-closing tags like <br /> (which is techincally XHTML), the parser fails to properly identify the tag when there is no internal space. So, for <br /> the tag is correctly identified as "br" in the second element of token array returned by ->get_token. But the <br/> tag (no intrnal space_ is identified as "br/" in the second element of the token array returned by ->get_token. Note that self-closing tags are not required to have an internal space in the XHTML spec, see http://www.w3.org/TR/xhtml1/#h-4.6 Here is a test case which demonstrates the problem: use strict; use HTML::TokeParser; my $htmlf = "line 1 is here <br> Now line 2 <br /> Now line 3 <br/> Now line 4"; my $parsed = HTML::TokeParser->new(\$htmlf); while (my $token = $parsed->get_token) { if ($token->[0] eq 'S') { print "start tag: " . $token->[1] . "(full text: '" . $token->[4] . "')\n"; } elsif ($token->[0] eq 'E') { print "end tag: " . $token->[1] . "(full text: '" . $token->[4] . "')\n"; } } This outputs: start tag: br(full text: '<br>') start tag: br(full text: '<br />') start tag: br/(full text: '<br/>') This mis-identification of the tag name can cause problems when I'm trying to filter for certain "allowed tags", for example in a message board post, and I have named "br" as an allowed tag. Now I must also identify "br/" as an allowed tag. Hope this makes sense! -Ryan Tate ryantate@ryantate.com
I've now uploaded 3.52 with some documentation tweaks that recommend enabling empty_element_tag for TokeParser. The reason it's not the default is that this isn't a backwards compatible change. I tried to make it a default in 3.47, but this broke LWP's test suite, so I'm sure it has the potential of breaking other code as well.
From: Graham Purnell
In contradiction of your posting, If you look at the link you quote (http://www.w3.org/TR/xhtml1/#h-4.6) you will notice that the W3C XHTML specification most definitely DOES require a space before the trailing slash of empty elements "for authors who wish their XHTML documents to render on existing HTML user agents". This information is contained in appendix C.2 I realise this information is of no help when parsing incorrectly structured XHTML documents; I'm just clarifying the assertion. On Tue Apr 25 00:45:21 2006, guest wrote: Show quoted text
> Hello, > > I enjoy HTML::TokeParser but just today noticed a flaw. When parsing > self-closing tags like <br /> (which is techincally XHTML), the parser > fails to properly identify the tag when there is no internal space. > > So, for <br /> the tag is correctly identified as "br" in the second > element of token array returned by ->get_token. But the <br/> tag (no > intrnal space_ is identified as "br/" in the second element of the
token Show quoted text
> array returned by ->get_token. > > Note that self-closing tags are not required to have an internal space > in the XHTML spec, see > > http://www.w3.org/TR/xhtml1/#h-4.6 > > Here is a test case which demonstrates the problem: > > > use strict; > use HTML::TokeParser; > > my $htmlf = "line 1 is here <br> Now line 2 <br /> Now line 3 <br/>
Now Show quoted text
> line 4"; > > my $parsed = HTML::TokeParser->new(\$htmlf); > > while (my $token = $parsed->get_token) { > if ($token->[0] eq 'S') { > print "start tag: " . $token->[1] . "(full text: '" . $token->[4]
. Show quoted text
> "')\n"; > } > elsif ($token->[0] eq 'E') { > print "end tag: " . $token->[1] . "(full text: '" . $token->[4] .
"')\n"; Show quoted text
> } > } > > This outputs: > > start tag: br(full text: '<br>') > start tag: br(full text: '<br />') > start tag: br/(full text: '<br/>') > > This mis-identification of the tag name can cause problems when I'm > trying to filter for certain "allowed tags", for example in a message > board post, and I have named "br" as an allowed tag. Now I must also > identify "br/" as an allowed tag. > > Hope this makes sense! > > -Ryan Tate > ryantate@ryantate.com
On Tue Jul 04 10:00:37 2006, guest wrote: Show quoted text
> In contradiction of your posting, If you look at the link you quote > (http://www.w3.org/TR/xhtml1/#h-4.6) you will notice that the W3C XHTML > specification most definitely DOES require a space before the trailing > slash of empty elements "for authors who wish their XHTML documents to > render on existing HTML user agents".
Wrong. It's a guideline for browser compatibility. A guideline is not a requirement. I read this prior to my original post.
Show quoted text
> Wrong. It's a guideline for browser compatibility. A guideline is not a > requirement. I read this prior to my original post.
-This was from me, ryantate@ryantate.com
From: Ryan Tate <ryantate [...] ryantate.com>
On Tue Jul 04 10:00:37 2006, guest wrote: Show quoted text
> the W3C XHTML > specification most definitely DOES require a space before the trailing > slash of empty elements "for authors who wish their XHTML documents to > render on existing HTML user agents".
To expand on my prior reply, there are several clear hints that Appendix C exists soley to provide information -- not rules -- to parties interested in being compatible with existing Web browsers. One is that it is called a set of "Guidelines," as mentioned previously. Another is the bold text at the very start of the appendix stating "this appendix is informative." Contrast this with other sections of the spec which lead with "this section is normative." Another hint is that the spec itself uses the br and hr tags self closed with zero spaces under the all-caps header "CORRECT:". This is in section 4.6, and is why I linked to it. Show quoted text
> I realise this information is of no help when parsing incorrectly > structured XHTML documents;
That's not what I'm trying to do. At all. Happy Independence Day. RT, in USA