Subject: | Handle <unclosed </tags |
The other day, I received a spam e-mail with a text/html body part like
this:
==============================================================
blah blah<br><br
<a href=http://domain/path.html target=_blank>Go!</a><br><p>blah
==============================================================
My spam filter failed to parse the href URL from the message body due to
the unclosed "<br" tag. Closing it causes HTML::Parser to correctly
parse the URL.
I noticed that http://search.cpan.org/dist/HTML-Parser/Parser.pm#BUGS says:
«Unclosed start or end tags, e.g. "<tt<b>...</b</tt>" are not recognized.»
I don't understand what the implication of this is, however. Is it a
conscious decision not to support unclosed tags, or has there just been
no use case for a fix?
I tried how various browsers handle the HTML code from the spam message
above:
At least the following do render the link despite the preceding broken
"<br" tag: Firefox 3, Konqueror from KDE 3.5.9, Safari 3 & 4, Mail.app
At least the following do NOT render the link: IE 6, Opera 9.63
I'd appreciate it if an option could be added to HTML::Parser to
recognize unclosed tags.