Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 72975
Status: stalled
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: stas [...] sysd.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: 4.2



Subject: newline separators between block elements in as_text()
Consider the following HTML sample: <p> <span>AAA</span> BBB </p> <h2>CCC</h2> DDD HTML::Element::as_text() method stringifies it as "AAABBBCCCDDD". Despite being correct, this is far from the actual renderization in a "real" browser. links(1), lynx(1) & w3m(1) break lines this way: AAA​BBB CCC DDD​​ The attached patch tries to implement the same behavior in the as_text() method. $/ value is inserted in place of line breaks, and "\x{200b}" (Unicode zero-width space) separates text from adjacent inline elements (y/\x{200b}//d could be used to definitively collapse text; or even y/\x{200b}/\n/, when one is sure that CSS enables a <span> tag to act as a block). I'm not sure if as_text() returning strings with "\n" would break stuff; at least, 'building.t' had to be patched. Would be glad to hear your opinions.
Subject: as_text.patch
diff -adNru HTML-Tree-4.2.orig/lib/HTML/Element.pm HTML-Tree-4.2/lib/HTML/Element.pm --- HTML-Tree-4.2.orig/lib/HTML/Element.pm 2011-04-06 05:37:54.000000000 -0300 +++ HTML-Tree-4.2/lib/HTML/Element.pm 2011-12-05 14:07:36.560782121 -0200 @@ -166,6 +166,26 @@ my $nillio = []; +# http://en.wikipedia.org/wiki/HTML_element#Block_elements +my $block_tags = { + map { $_ => 1 } qw( + p + h1 h2 h3 h4 h5 h6 + dl dt dd + ol ul li + dir + address + blockquote + center + del + div + hr + ins + noscript script + pre + ) +}; + *HTML::Element::emptyElement = \%HTML::Tagset::emptyElement; # legacy *HTML::Element::optionalEndTag = \%HTML::Tagset::optionalEndTag; # legacy *HTML::Element::linkElements = \%HTML::Tagset::linkElements; # legacy @@ -1773,10 +1793,24 @@ $text .= shift @pile; } else { # it's a ref -- traverse under it - unshift @pile, @{ $this->{'_content'} || $nillio } - unless ( $tag = ( $this = shift @pile )->{'_tag'} ) eq 'style' - or $tag eq 'script' - or ( $skip_dels and $tag eq 'del' ); + $this = shift @pile; + $tag = $this->{'_tag'}; + my @rest = @{ $this->{'_content'} || $nillio }; + + if ( exists $block_tags->{$tag} ) { + push @rest, $/; + } + elsif ( $tag eq 'br' ) { + push @rest, $/; + } + else { + push @rest, "\x{200b}"; # zero-width space (ZWSP) + } + + unshift @pile, @rest + unless $tag eq 'style' + or $tag eq 'script' + or ( $skip_dels and $tag eq 'del' ); } } return $text; diff -adNru HTML-Tree-4.2.orig/t/building.t HTML-Tree-4.2/t/building.t --- HTML-Tree-4.2.orig/t/building.t 2011-04-06 05:37:54.000000000 -0300 +++ HTML-Tree-4.2/t/building.t 2011-12-05 14:09:55.985039039 -0200 @@ -52,7 +52,10 @@ isa_ok( $div, 'HTML::Element' ); ### tests of various output formats - is( $div->as_text(), " 1 2 3 ", "Dump element in text format" ); + { + local $/ = ''; + is( $div->as_text(), " 1 2 3 ", "Dump element in text format" ); + }; is( $div->as_trimmed_text(), "1 2 3", "Dump element in trimmed text format" ); is( $div->as_text_trimmed(), "1 2 3", @@ -72,7 +75,10 @@ isa_ok( $div2, 'HTML::Element' ); ### test for RT #26436 user controlled white space - is( $div2->as_text(), " 1 &nbsp; 2 \xA0 3 ", "Dump element in text format" ); + { + local $/ = ''; + is( $div2->as_text(), " 1 &nbsp; 2 \xA0 3 ", "Dump element in text format" ); + }; is( $div2->as_trimmed_text(), "1 &nbsp; 2 \xA0 3", "Dump element in trimmed text format" ); is( $div2->as_trimmed_text( extra_chars => '&nbsp;\xA0' ),
I'm not going to merge this as-is. It would break too much code that expects the current behavior. The $block_tags hash really ought to be added to HTML::Tagset instead. One problem, though, is that <ins> and <del> are not necessarily block-level tags. They're either block-level or inline, depending on context. as_text is never going to be a proper formatter like lynx. We already have the format method and HTML::FormatText for that. I would consider a patch that added an option (like skip_dels) to add newlines after specified tags, as long as it wasn't too complex. Pull requests on https://github.com/madsen/HTML-Tree are the preferred way to send patches.