Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 66498
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: seb [...] taureau.webhop.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 3.23
  • 4.1
Fixed in: 5.03



Subject: HTML::Element::as_text() removes space between words
Hi, See this: $ echo '<p>eee</p><p>fff</p>' > /tmp/test.html $ perl -MHTML::TreeBuilder -e '$t = HTML::TreeBuilder->new; $t- Show quoted text
>parse_file("/tmp/test.html"); $t->eof; print $t->as_text, "\n"'
eeefff But it should print "eee fff". Because of this bug, $t->as_text =~ / \bfff\b/ will evaluate to false while it should be true. Same thing with no_space_compacting: $ perl -MHTML::TreeBuilder -e '$t = HTML::TreeBuilder->new; $t- Show quoted text
>no_space_compacting(1); $t->parse_file("/tmp/test.html"); $t->eof;
print $t->as_text, "\n"' eeefff
Hi, as_text does not add white space, it only ever comapcts or removes it. There are other modules that offer intelligently formatted text output. HTML::FormatText is based on HTML::TreeBuilder and offers this functionality. $ echo '<p>eee</p><p>fff</p>' > /tmp/test.html $ perl -MHTML::TreeBuilder -e '$t = HTML::TreeBuilder->new;$t->ignore_ignorable_whitespace(0); $t->parse_file("/tmp/test.html"); require HTML::FormatText;$t->eof; require HTML::FormatText; $f = HTML::FormatText->new(leftmargin => 0, rightmargin => 50); print $f->format($t)' eee fff FYI these two work if the space exists in the document: $ echo '<p>eee</p><p> fff</p>' > /tmp/test.html $ perl -MHTML::TreeBuilder -e '$t = HTML::TreeBuilder->new;$t->parse_file("/tmp/test.html"); $t->eof; print $t->as_text, "\n"' eee fff $ echo '<p>eee</p> <p>fff</p>' > /tmp/test.html $ perl -MHTML::TreeBuilder -e '$t = HTML::TreeBuilder->new;$t->ignore_ignorable_whitespace(0); $t->parse_file("/tmp/test.html"); $t->eof; print $t->as_text, "\n"' eee fff Cheers, Jeff.
Hi, Maybe the manpage should clarify this behavior. It really caused a lot of trouble to me. A quick Google search shows that I'm not the only one not to understand the behavior of as_text() : http://www.issociate.de/board/post/222857/ HTML::TreeBuilder_ignore_ignorable_problems.html http://cpanforum.com/threads/657 When I read the manpage of HTML::Element, it's clear to me that as_text () doesn't provide any kind of formatting. So I would think that it would return a semantically correct text. I don't want a formatted text, so why would I use a formatting module like HTML::FormatText (which by the way, seems not to work with unicode)? Returning "eee fff" instead of "eeefff" has nothing to do with presentation, but with content. "eeefff" is a non-existent word in the HTML page. The "htext" script that comes with HTML::Parser does the right thing: it removes the presentation and gives the content. It displays "eee fff" correctly. I think that either as_text() or the manpage is wrong. I'm certain that other people will face this issue (and maybe lost considerable amounts of work, like me) if both the function and the manpage are left as is. Regards, Sebastien
There is nothing in the man page that says that any white space will be added to the input, and what you want requires adding white space. It works as it says, it returns the text segments _exactly_ as they are in the content. To add the white space in only the "right places" we'd need to be able to differentiate in-line from block tags. e.g. <i>Fu</i><b>Bar</b> is a single word, FuBar, with visual markup. It shouldn't be two words. AFAIK none of the modules we use provide this information and making it would be exhausting, there was some talk a while back about getting HTML::Tagset to provide this information, but I don't know if that effort got anywhere.
I've updated the docs to make it clear that as_text does not add whitespace not found in the original document (just released as 5.03).