Subject: | HTML::TokeParser confused by self-closing tag without internal space |
Hello,
I enjoy HTML::TokeParser but just today noticed a flaw. When parsing
self-closing tags like <br /> (which is techincally XHTML), the parser
fails to properly identify the tag when there is no internal space.
So, for <br /> the tag is correctly identified as "br" in the second
element of token array returned by ->get_token. But the <br/> tag (no
intrnal space_ is identified as "br/" in the second element of the token
array returned by ->get_token.
Note that self-closing tags are not required to have an internal space
in the XHTML spec, see
http://www.w3.org/TR/xhtml1/#h-4.6
Here is a test case which demonstrates the problem:
use strict;
use HTML::TokeParser;
my $htmlf = "line 1 is here <br> Now line 2 <br /> Now line 3 <br/> Now
line 4";
my $parsed = HTML::TokeParser->new(\$htmlf);
while (my $token = $parsed->get_token) {
if ($token->[0] eq 'S') {
print "start tag: " . $token->[1] . "(full text: '" . $token->[4] .
"')\n";
}
elsif ($token->[0] eq 'E') {
print "end tag: " . $token->[1] . "(full text: '" . $token->[4] . "')\n";
}
}
This outputs:
start tag: br(full text: '<br>')
start tag: br(full text: '<br />')
start tag: br/(full text: '<br/>')
This mis-identification of the tag name can cause problems when I'm
trying to filter for certain "allowed tags", for example in a message
board post, and I have named "br" as an allowed tag. Now I must also
identify "br/" as an allowed tag.
Hope this makes sense!
-Ryan Tate
ryantate@ryantate.com