Skip Menu |

This queue is for tickets about the XML-XPathEngine CPAN distribution.

Report information
The Basics
Id: 65824
Status: open
Priority: 0/
Queue: XML-XPathEngine

People
Owner: Nobody in particular
Requestors: JEB [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 0.12
Fixed in: (no value)



Subject: as_litteral does a join with '' (null); could this be user defined?
Hello, When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that defines a block level element (for example, a "//div"), then text is returned as a to_litteral() of the child nodes. If I have several P child notes, all the text from these paragraphs are concatenated, currently with no separation between the paragraphs. My goal is to have text such as: <p>This is <strong>a test</strong></p><p>Another <em>one</em></p> Be returned as: This is a test\nAnother one Whereas currently this is: This is a testAnother one Is it feasible to pass an argument to to_litteral() for the character to join() these together with? In the case above, my preferred join character would be new line ("\n") on block level elements, and null ("") on inline elements. ... or am I going the wrong way? Many thanks, JEB Thanks
On Thu Feb 17 04:49:23 2011, JEB wrote: Show quoted text
> Hello, > When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that > defines a block level element (for example, a "//div"), then text is > returned as a to_litteral() of the child nodes. If I have several P > child notes, all the text from these paragraphs are concatenated, > currently with no separation between the paragraphs. > > My goal is to have text such as: > <p>This is <strong>a test</strong></p><p>Another <em>one</em></p> > > Be returned as: > This is a test\nAnother one > > Whereas currently this is: > This is a testAnother one > > Is it feasible to pass an argument to to_litteral() for the character to > join() these together with? In the case above, my preferred join > character would be new line ("\n") on block level elements, and null > ("") on inline elements. > > ... or am I going the wrong way?
Oops Sorry for answering so late, I completely forgot about this report. There is a simple way to do this: get the nodes with findnodes, and then join their text: #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $root = HTML::TreeBuilder::XPath->new_from_content( '<html><p>This is <strong>a test</strong></p><p>Another <em>one</em></p>'); my $pars= join "\n", map { $_->as_text } $root->findnodes( '//p'); print $pars, "\n"; Is this what you were looking for? Thanks __ mirod
Subject: Re: [rt.cpan.org #65824] as_litteral does a join with '' (null); could this be user defined?
Date: Fri, 04 Mar 2011 16:35:48 +0800
To: bug-XML-XPathEngine [...] rt.cpan.org
From: James Bromberger <james [...] rcpt.to>
On 4/03/2011 4:24 PM, MIROD via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=65824 > > > On Thu Feb 17 04:49:23 2011, JEB wrote:
>> Hello, >> When I get a NodeSet (from HTML::TreeBuilder::XPath) for a path that >> defines a block level element (for example, a "//div"), then text is >> returned as a to_litteral() of the child nodes. If I have several P >> child notes, all the text from these paragraphs are concatenated, >> currently with no separation between the paragraphs. >> >> My goal is to have text such as: >> <p>This is <strong>a test</strong></p><p>Another <em>one</em></p> >> >> Be returned as: >> This is a test\nAnother one >> >> Whereas currently this is: >> This is a testAnother one >> >> Is it feasible to pass an argument to to_litteral() for the character to >> join() these together with? In the case above, my preferred join >> character would be new line ("\n") on block level elements, and null >> ("") on inline elements. >> >> ... or am I going the wrong way?
> > Oops > > Sorry for answering so late, I completely forgot about this report. > > There is a simple way to do this: get the nodes with findnodes, and then > join their text: > > #!/usr/bin/perl > > use strict; > use warnings; > > use HTML::TreeBuilder::XPath; > > my $root = HTML::TreeBuilder::XPath->new_from_content( '<html><p>This is > <strong>a test</strong></p><p>Another <em>one</em></p>'); > > my $pars= join "\n", map { $_->as_text } $root->findnodes( '//p'); > print $pars, "\n"; > > > Is this what you were looking for?
Well, I am handing HTML written out on the web, so its not always in "p" nodes. So I wrote the following: sub plain_text_from_element { my ( $self, $element ) = @_; my $string; die "Not an HTML::Element: $element" if ref($element) ne "HTML::Element"; foreach ( $element->content_list ) { if ( not ref($_) ) { $string .= $_; } elsif ( defined $inline{ $_->tag } ) { my $new = $self->plain_text_from_element($_); next unless defined $new; $string .= $new if defined $new; } else { my $new = $self->plain_text_from_element($_); next unless defined $new; $new =~ s/\s{2,}/ /g; $new =~ s/^\s+//g; $new =~ s/\s+$//g; next unless length($new); my $word_count = $new =~ s/((^|\s)\S)/$1/g; #print "Found $word_count words\n"; $string .= ( $string ? "\n\n" : "" ) . $new; } } return $string; } This uses a hash that lets me determine if a particular element is defined (as by W3C) as a block level or inline element. Blocklevel elements are thush joined with "\n\n", and inline ones are just joined with "". Thus if I have an XPath representing say a DIV of a document, using this function returns me human readable text, whereas the default was squishing together paragraphs (as the end of one paragraph had no trailing space). I was looking for a CPAn module that defined what elements were block and what were inline, but I couldn't see this. So my hash ended up being in my own library of: my @inline = qw ( a abbr acronym b basefont bdo big cite code dfn em font i img input kbd label q s samp select small span strike strong sub sup textarea tt u var); my %inline = map { $_ => 1 } @inline; Case in point, some sites in their infinite wisdom (!) decide that using <td> is a suitable separator (with no hint of a <p>); using this returns the text quite nicely as a new paragraph. JEB -- Mobile: +61 422 166 708, Email: james_AT_rcpt.to