Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 53658
Status: open
Priority: 0/
Queue: HTML-Tree

People
Owner: Jeff.Fearn [...] gmail.com
Requestors: garu [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 3.23
Fixed in: 3.23



Subject: HTML::Element::as_text collapses internal whitespace
Hi there! Thanks for this awesome module! However, it appears the ->as_text method for HTML::Element doesn't work as expected. From the docs: -----------8<------------ $h->as_text() Returns a string consisting of only the text parts of the element's descendants. Text under 'script' or 'style' elements is never included in what's returned. If C<skip_dels> is true, then text content under "del" nodes is not included in what's returned. $h->as_trimmed_text(...) This is just like as_text(...) except that leading and trailing whitespace is deleted, and any internal whitespace is collapsed. ----------->8------------ although this is true for leading/trailing spaces, all internal whitespace is collapsed (which should only happen in ->as_trimmed_text, right?). small proof of concept: ======================= perl -MHTML::TreeBuilder -E 'say HTML::TreeBuilder->new_from_content(q[<div>foo bar</div>])->as_text' foo bar hope this helps!
From: jfearn [...] redhat.com
On Wed Jan 13 12:39:05 2010, GARU wrote: Show quoted text
> Hi there! Thanks for this awesome module! > > However, it appears the ->as_text method for HTML::Element doesn't work > as expected. From the docs: > > -----------8<------------ > $h->as_text() > > Returns a string consisting of only the text parts of the element's > descendants. > > Text under 'script' or 'style' elements is never included in what's > returned. If C<skip_dels> is true, then text content under "del" > nodes is not included in what's returned. > > $h->as_trimmed_text(...) > > This is just like as_text(...) except that leading and trailing > whitespace is deleted, and any internal whitespace is collapsed. > > ----------->8------------ > > although this is true for leading/trailing spaces, all internal > whitespace is collapsed (which should only happen in ->as_trimmed_text, > right?). > > > small proof of concept: > ======================= > > perl -MHTML::TreeBuilder -E 'say > HTML::TreeBuilder->new_from_content(q[<div>foo bar</div>])->as_text' > foo bar > > > hope this helps!
I think this may be HTML::Parser doing this, so HTML::TreeBuilder never sees the original white space. perl -MHTML::TreeBuilder -E 'say HTML::TreeBuilder->new_from_content(q[<pre>foo bar</pre>])->as_text' foo bar Both those have 4 spaces between foo and bar, maybe RT will eat it :} But for verbatim tags as_text acts correctly. Not sure if that is very helpful :) Cheers, Jeff.
I'm pretty sure that HTML::Parser is consolidating white space when parsing, leaving only one space, on non-verbatim tags.
I'm pretty sure that HTML::Parser reports the white space just as it was found in the document parsed.
On Sun Apr 25 06:39:11 2010, GAAS wrote: Show quoted text
> I'm pretty sure that HTML::Parser reports the white space just as it > was found in the document > parsed.
You are correct, there is an option, no_space_compacting, which defaults off, that controls this. $ perl -MHTML::TreeBuilder -E '$tree = HTML::TreeBuilder->new(no_space_compacting => 1); $tree->parse(q[<div>foo bar</div>]); print $tree->as_text, "\n"' foo bar There are 4 spaces in the input & output, as expected. I'm not sure why this was defaulted off, but changing the default now would have an unknown impact on current users :(
On Sat Apr 24 20:12:20 2010, jfearn wrote: Show quoted text
> You are correct, there is an option, no_space_compacting, which
defaults Show quoted text
> off, that controls this. > > $ perl -MHTML::TreeBuilder -E '$tree = > HTML::TreeBuilder->new(no_space_compacting => 1); > $tree->parse(q[<div>foo bar</div>]); print $tree->as_text, "\n"' > foo bar > > There are 4 spaces in the input & output, as expected. > > I'm not sure why this was defaulted off, but changing the default now > would have an unknown impact on current users :( > >
Hey guys, thanks a lot for all the help on this issue. For the record, as the original requestor, I don't have a problem with ->as_text keeping its current behavior (i.e. not fixing it), as long as the documentation is updated on as_text to mention this and include reference to the 'no_space_compacting' attribute, while also adjusting the '...and any internal whitespace is collapsed.' part of the as_trimmed_text() right below. If I may give a suggestion, I think it would be nice to have something like: ->as_text( no_space_compacting => 1 ) or even using another key, like (for example): ->as_text( internal_whitespace => 1 ) to return the text with internal whitespace. Please note, however, that this should be local to that particular returned value, not a global attribute such as the current 'no_space_compacting' - which is part of the reason I suggested a different name :) Thanks again for all the help!
Hi Breno, On Sun Apr 25 13:21:51 2010, GARU wrote: Show quoted text
> On Sat Apr 24 20:12:20 2010, jfearn wrote:
> > You are correct, there is an option, no_space_compacting, which
> defaults
> > off, that controls this. > > > > $ perl -MHTML::TreeBuilder -E '$tree = > > HTML::TreeBuilder->new(no_space_compacting => 1); > > $tree->parse(q[<div>foo bar</div>]); print $tree->as_text, "\n"' > > foo bar > > > > There are 4 spaces in the input & output, as expected. > > > > I'm not sure why this was defaulted off, but changing the default now > > would have an unknown impact on current users :( > > > >
> > Hey guys, thanks a lot for all the help on this issue. For the record, > as the original requestor, I don't have a problem with ->as_text keeping > its current behavior (i.e. not fixing it), as long as the documentation > is updated on as_text to mention this and include reference to the > 'no_space_compacting' attribute, while also adjusting the '...and any > internal whitespace is collapsed.' part of the as_trimmed_text() right > below.
I agree that this isn't clear. One of the reasons for this is that as_text is part of HTML::Element, but no_space_compacting is part of HTML::TreeBuilder. HTML::Element, being a lower level module, has no idea no_space_compacting exists, so discussing it there seems a bit odd. I'm looking in to how to phrase 'some modules using HTML::Element may filter HTML when parsing it, check their options' ... OK, I'll work on it some more ;) Show quoted text
> If I may give a suggestion, I think it would be nice to have something > like: > > ->as_text( no_space_compacting => 1 ) > > or even using another key, like (for example): > > ->as_text( internal_whitespace => 1 ) > > to return the text with internal whitespace. Please note, however, that > this should be local to that particular returned value, not a global > attribute such as the current 'no_space_compacting' - which is part of > the reason I suggested a different name :)
This would require migrating some of the parse time behaviour in HTML::TreeBuilder in to the output behaviour of HTML::Element. I'm not opposed to this, but it would impose a significant testing burden to avoid breaking existing uses. Show quoted text
> Thanks again for all the help!
Thanks for the positive feedback :)