Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 70804
Status: rejected
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: C.Plevier [...] dutchspace.nl
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: HTML whitespace in text
Date: Thu, 8 Sep 2011 12:56:47 +0200
To: <bug-HTML-Parser [...] rt.cpan.org>
From: "Plevier, Camiel" <C.Plevier [...] dutchspace.nl>
Dear Sirs, The HTML markup can be arbitrarily arranged with whitespace to allow readability of the code. HTML::Parser does not remove this when returning text. http://www.w3.org/TR/html4/struct/text.html#h-9.1 <http://www.w3.org/TR/html4/struct/text.html> states: "user agents should collapse input white space sequences when producing output inter-word space". It is unclear to me whether leading and trailing whitespace should be removed entirely. At least that is what my browser does. Please find attached a piece of perl that demonstrates the problem. Best regards, Camiel Plevier C.M. Plevier MEng Digital System Developer Sr. Engineer Software Dutch Space <http://www.dutchspace.nl/Default.asp?LangType=1033> an EADS Astrium company <<testHtmlTextWhiteSpace.pl>> # violation of http://www.w3.org/TR/html4/struct/text.html#whitespace # "user agents should collapse input white space # sequences when producing output inter-word space" use HTML::Parser; my $p = new HTML::Parser(text_h => [ sub { print "text: @_\n"; }, "dtext"]); $p->parse("<p> Bla bla... </p>"); # Should have resulted in output "text: Bla bla..." -- ---------------------------------------------------------------------- Dutch Space B.V. te Leiden. KvK nummer: 28086907. Dutch Space B.V., Leiden, The Netherlands. Chamber of Commerce number 28086907 -- ---------------------------------------------------------------------- This communication is intended for use by the addressee and may contain confidential or privileged information. If you receive this communication unintentionally, please notify us immediately and delete the message from your computer without making any copies. -- ---------------------------------------------------------------------- Please consider the environment before printing this email -- ----------------------------------------------------------------------

Message body is not shown because sender requested not to inline it.

HTML::Parser does not collapse white space. That's intentional. One of its design goals is to be able to filter HTML with minimal edits. To collapse white space correctly we would have to understand the meaning of all tags. For that you need to use a higher level module like HTML-Tree.