On 4/03/2011 4:24 PM, MIROD via RT wrote:
Show quoted text> <URL:
https://rt.cpan.org/Ticket/Display.html?id=65824 >
>
> On Thu Feb 17 04:49:23 2011, JEB wrote:
>> Hello,
>> When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that
>> defines a block level element (for example, a "//div"), then text is
>> returned as a to_litteral() of the child nodes. If I have several P
>> child notes, all the text from these paragraphs are concatenated,
>> currently with no separation between the paragraphs.
>>
>> My goal is to have text such as:
>> <p>This is <strong>a test</strong></p><p>Another <em>one</em></p>
>>
>> Be returned as:
>> This is a test\nAnother one
>>
>> Whereas currently this is:
>> This is a testAnother one
>>
>> Is it feasible to pass an argument to to_litteral() for the character to
>> join() these together with? In the case above, my preferred join
>> character would be new line ("\n") on block level elements, and null
>> ("") on inline elements.
>>
>> ... or am I going the wrong way?
>
> Oops
>
> Sorry for answering so late, I completely forgot about this report.
>
> There is a simple way to do this: get the nodes with findnodes, and then
> join their text:
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> use HTML::TreeBuilder::XPath;
>
> my $root = HTML::TreeBuilder::XPath->new_from_content( '<html><p>This is
> <strong>a test</strong></p><p>Another <em>one</em></p>');
>
> my $pars= join "\n", map { $_->as_text } $root->findnodes( '//p');
> print $pars, "\n";
>
>
> Is this what you were looking for?
Well, I am handing HTML written out on the web, so its not always in "p"
nodes. So I wrote the following:
sub plain_text_from_element {
my ( $self, $element ) = @_;
my $string;
die "Not an HTML::Element: $element" if ref($element) ne
"HTML::Element";
foreach ( $element->content_list ) {
if ( not ref($_) ) {
$string .= $_;
}
elsif ( defined $inline{ $_->tag } ) {
my $new = $self->plain_text_from_element($_);
next unless defined $new;
$string .= $new if defined $new;
}
else {
my $new = $self->plain_text_from_element($_);
next unless defined $new;
$new =~ s/\s{2,}/ /g;
$new =~ s/^\s+//g;
$new =~ s/\s+$//g;
next unless length($new);
my $word_count = $new =~ s/((^|\s)\S)/$1/g;
#print "Found $word_count words\n";
$string .= ( $string ? "\n\n" : "" ) . $new;
}
}
return $string;
}
This uses a hash that lets me determine if a particular element is
defined (as by W3C) as a block level or inline element. Blocklevel
elements are thush joined with "\n\n", and inline ones are just joined
with "". Thus if I have an XPath representing say a DIV of a document,
using this function returns me human readable text, whereas the default
was squishing together paragraphs (as the end of one paragraph had no
trailing space).
I was looking for a CPAn module that defined what elements were block
and what were inline, but I couldn't see this. So my hash ended up being
in my own library of:
my @inline =
qw ( a abbr acronym b basefont bdo big cite code dfn em font i img
input kbd label q s samp select small span strike strong sub sup
textarea tt u var);
my %inline = map { $_ => 1 } @inline;
Case in point, some sites in their infinite wisdom (!) decide that using
<td> is a suitable separator (with no hint of a <p>); using this returns
the text quite nicely as a new paragraph.
JEB
--
Mobile: +61 422 166 708, Email: james_AT_rcpt.to