Bug #65824 for XML-XPathEngine: as_litteral does a join with '' (null); could this be user defined?

Thu Feb 17 04:49:23 2011 JEB [...] cpan.org - Ticket created

Subject:

as_litteral does a join with '' (null); could this be user defined?

Hello, When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that defines a block level element (for example, a "//div"), then text is returned as a to_litteral() of the child nodes. If I have several P child notes, all the text from these paragraphs are concatenated, currently with no separation between the paragraphs. My goal is to have text such as: This is a testAnother one Be returned as: This is a test\nAnother one Whereas currently this is: This is a testAnother one Is it feasible to pass an argument to to_litteral() for the character to join() these together with? In the case above, my preferred join character would be new line ("\n") on block level elements, and null ("") on inline elements. ... or am I going the wrong way? Many thanks, JEB Thanks

Fri Mar 04 03:24:55 2011 MIROD [...] cpan.org - Correspondence added

On Thu Feb 17 04:49:23 2011, JEB wrote: Show quoted text

> Hello, > When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that > defines a block level element (for example, a "//div"), then text is > returned as a to_litteral() of the child nodes. If I have several P > child notes, all the text from these paragraphs are concatenated, > currently with no separation between the paragraphs. > > My goal is to have text such as: > This is a testAnother one > > Be returned as: > This is a test\nAnother one > > Whereas currently this is: > This is a testAnother one > > Is it feasible to pass an argument to to_litteral() for the character to > join() these together with? In the case above, my preferred join > character would be new line ("\n") on block level elements, and null > ("") on inline elements. > > ... or am I going the wrong way?

Oops Sorry for answering so late, I completely forgot about this report. There is a simple way to do this: get the nodes with findnodes, and then join their text: #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $root = HTML::TreeBuilder::XPath->new_from_content( '<html>This is a testAnother one'); my $pars= join "\n", map { $_->as_text } $root->findnodes( '//p'); print $pars, "\n"; Is this what you were looking for? Thanks __ mirod

Fri Mar 04 03:24:55 2011 The RT System itself - Status changed from 'new' to 'open'

Fri Mar 04 03:36:36 2011 james [...] rcpt.to - Correspondence added

Subject:	Re: [rt.cpan.org #65824] as_litteral does a join with '' (null); could this be user defined?
Date:	Fri, 04 Mar 2011 16:35:48 +0800
To:	bug-XML-XPathEngine [...] rt.cpan.org
From:	James Bromberger <james [...] rcpt.to>

On 4/03/2011 4:24 PM, MIROD via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=65824 > > > On Thu Feb 17 04:49:23 2011, JEB wrote:

>> Hello, >> When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that >> defines a block level element (for example, a "//div"), then text is >> returned as a to_litteral() of the child nodes. If I have several P >> child notes, all the text from these paragraphs are concatenated, >> currently with no separation between the paragraphs. >> >> My goal is to have text such as: >> This is a testAnother one >> >> Be returned as: >> This is a test\nAnother one >> >> Whereas currently this is: >> This is a testAnother one >> >> Is it feasible to pass an argument to to_litteral() for the character to >> join() these together with? In the case above, my preferred join >> character would be new line ("\n") on block level elements, and null >> ("") on inline elements. >> >> ... or am I going the wrong way?

> > Oops > > Sorry for answering so late, I completely forgot about this report. > > There is a simple way to do this: get the nodes with findnodes, and then > join their text: > > #!/usr/bin/perl > > use strict; > use warnings; > > use HTML::TreeBuilder::XPath; > > my $root = HTML::TreeBuilder::XPath->new_from_content( '<html>This is > a testAnother one'); > > my $pars= join "\n", map { $_->as_text } $root->findnodes( '//p'); > print $pars, "\n"; > > > Is this what you were looking for?

Well, I am handing HTML written out on the web, so its not always in "p" nodes. So I wrote the following: sub plain_text_from_element { my ( $self, $element ) = @_; my $string; die "Not an HTML::Element: $element" if ref($element) ne "HTML::Element"; foreach ( $element->content_list ) { if ( not ref($_) ) { $string .= $_; } elsif ( defined $inline{ $_->tag } ) { my $new = $self->plain_text_from_element($_); next unless defined $new; $string .= $new if defined $new; } else { my $new = $self->plain_text_from_element($_); next unless defined $new; $new =~ s/\s{2,}/ /g; $new =~ s/^\s+//g; $new =~ s/\s+$//g; next unless length($new); my $word_count = $new =~ s/((^|\s)\S)/$1/g; #print "Found $word_count words\n"; $string .= ( $string ? "\n\n" : "" ) . $new; } } return $string; } This uses a hash that lets me determine if a particular element is defined (as by W3C) as a block level or inline element. Blocklevel elements are thush joined with "\n\n", and inline ones are just joined with "". Thus if I have an XPath representing say a DIV of a document, using this function returns me human readable text, whereas the default was squishing together paragraphs (as the end of one paragraph had no trailing space). I was looking for a CPAn module that defined what elements were block and what were inline, but I couldn't see this. So my hash ended up being in my own library of: my @inline = qw ( a abbr acronym b basefont bdo big cite code dfn em font i img input kbd label q s samp select small span strike strong sub sup textarea tt u var); my %inline = map { $_ => 1 } @inline; Case in point, some sites in their infinite wisdom (!) decide that using <td> is a suitable separator (with no hint of a ); using this returns the text quite nicely as a new paragraph. JEB -- Mobile: +61 422 166 708, Email: james_AT_rcpt.to