Subject: | newline separators between block elements in as_text() |
Consider the following HTML sample:
<p>
<span>AAA</span>
BBB
</p>
<h2>CCC</h2>
DDD
HTML::Element::as_text() method stringifies it as "AAABBBCCCDDD".
Despite being correct, this is far from the actual renderization in a
"real" browser. links(1), lynx(1) & w3m(1) break lines this way:
AAABBB
CCC
DDD
The attached patch tries to implement the same behavior in the as_text()
method. $/ value is inserted in place of line breaks, and "\x{200b}"
(Unicode zero-width space) separates text from adjacent inline elements
(y/\x{200b}//d could be used to definitively collapse text; or even
y/\x{200b}/\n/, when one is sure that CSS enables a <span> tag to act as
a block).
I'm not sure if as_text() returning strings with "\n" would break stuff;
at least, 'building.t' had to be patched.
Would be glad to hear your opinions.
Subject: | as_text.patch |
diff -adNru HTML-Tree-4.2.orig/lib/HTML/Element.pm HTML-Tree-4.2/lib/HTML/Element.pm
--- HTML-Tree-4.2.orig/lib/HTML/Element.pm 2011-04-06 05:37:54.000000000 -0300
+++ HTML-Tree-4.2/lib/HTML/Element.pm 2011-12-05 14:07:36.560782121 -0200
@@ -166,6 +166,26 @@
my $nillio = [];
+# http://en.wikipedia.org/wiki/HTML_element#Block_elements
+my $block_tags = {
+ map { $_ => 1 } qw(
+ p
+ h1 h2 h3 h4 h5 h6
+ dl dt dd
+ ol ul li
+ dir
+ address
+ blockquote
+ center
+ del
+ div
+ hr
+ ins
+ noscript script
+ pre
+ )
+};
+
*HTML::Element::emptyElement = \%HTML::Tagset::emptyElement; # legacy
*HTML::Element::optionalEndTag = \%HTML::Tagset::optionalEndTag; # legacy
*HTML::Element::linkElements = \%HTML::Tagset::linkElements; # legacy
@@ -1773,10 +1793,24 @@
$text .= shift @pile;
}
else { # it's a ref -- traverse under it
- unshift @pile, @{ $this->{'_content'} || $nillio }
- unless ( $tag = ( $this = shift @pile )->{'_tag'} ) eq 'style'
- or $tag eq 'script'
- or ( $skip_dels and $tag eq 'del' );
+ $this = shift @pile;
+ $tag = $this->{'_tag'};
+ my @rest = @{ $this->{'_content'} || $nillio };
+
+ if ( exists $block_tags->{$tag} ) {
+ push @rest, $/;
+ }
+ elsif ( $tag eq 'br' ) {
+ push @rest, $/;
+ }
+ else {
+ push @rest, "\x{200b}"; # zero-width space (ZWSP)
+ }
+
+ unshift @pile, @rest
+ unless $tag eq 'style'
+ or $tag eq 'script'
+ or ( $skip_dels and $tag eq 'del' );
}
}
return $text;
diff -adNru HTML-Tree-4.2.orig/t/building.t HTML-Tree-4.2/t/building.t
--- HTML-Tree-4.2.orig/t/building.t 2011-04-06 05:37:54.000000000 -0300
+++ HTML-Tree-4.2/t/building.t 2011-12-05 14:09:55.985039039 -0200
@@ -52,7 +52,10 @@
isa_ok( $div, 'HTML::Element' );
### tests of various output formats
- is( $div->as_text(), " 1 2 3 ", "Dump element in text format" );
+ {
+ local $/ = '';
+ is( $div->as_text(), " 1 2 3 ", "Dump element in text format" );
+ };
is( $div->as_trimmed_text(), "1 2 3",
"Dump element in trimmed text format" );
is( $div->as_text_trimmed(), "1 2 3",
@@ -72,7 +75,10 @@
isa_ok( $div2, 'HTML::Element' );
### test for RT #26436 user controlled white space
- is( $div2->as_text(), " 1 2 \xA0 3 ", "Dump element in text format" );
+ {
+ local $/ = '';
+ is( $div2->as_text(), " 1 2 \xA0 3 ", "Dump element in text format" );
+ };
is( $div2->as_trimmed_text(),
"1 2 \xA0 3", "Dump element in trimmed text format" );
is( $div2->as_trimmed_text( extra_chars => ' \xA0' ),