Subject: | HTML whitespace in text |
Date: | Thu, 8 Sep 2011 12:56:47 +0200 |
To: | <bug-HTML-Parser [...] rt.cpan.org> |
From: | "Plevier, Camiel" <C.Plevier [...] dutchspace.nl> |
Dear Sirs,
The HTML markup can be arbitrarily arranged with whitespace to allow
readability of the code. HTML::Parser does not remove this when
returning text.
http://www.w3.org/TR/html4/struct/text.html#h-9.1
<http://www.w3.org/TR/html4/struct/text.html> states: "user agents
should collapse input white space sequences when producing output
inter-word space". It is unclear to me whether leading and trailing
whitespace should be removed entirely. At least that is what my browser
does.
Please find attached a piece of perl that demonstrates the problem.
Best regards,
Camiel Plevier
C.M. Plevier MEng
Digital System Developer
Sr. Engineer Software
Dutch Space <http://www.dutchspace.nl/Default.asp?LangType=1033>
an EADS Astrium company
<<testHtmlTextWhiteSpace.pl>>
# violation of http://www.w3.org/TR/html4/struct/text.html#whitespace
# "user agents should collapse input white space
# sequences when producing output inter-word space"
use HTML::Parser;
my $p = new HTML::Parser(text_h => [ sub {
print "text: @_\n";
}, "dtext"]);
$p->parse("<p>
Bla
bla...
</p>");
# Should have resulted in output "text: Bla bla..."
-- ----------------------------------------------------------------------
Dutch Space B.V. te Leiden. KvK nummer: 28086907.
Dutch Space B.V., Leiden, The Netherlands. Chamber of Commerce number 28086907
-- ----------------------------------------------------------------------
This communication is intended for use by the addressee and may
contain confidential or privileged information. If you receive this
communication unintentionally, please notify us immediately and
delete the message from your computer without making any copies.
-- ----------------------------------------------------------------------
Please consider the environment before printing this email
-- ----------------------------------------------------------------------
Message body is not shown because sender requested not to inline it.