Bug #33811 for XML-LibXML: need for ::Text filter

Tue Mar 04 16:54:03 2008 MARKOV [...] cpan.org - Ticket created

Subject:

need for ::Text filter

XML::Compile walks through many XML documents where white-space is unimportant. However, now it has to skip over ::Text nodes which are produced by the XML::LibXML::Parser. It would be nice to have a $node->getNonTextChilds and $node->firstNonTextChild and $node->nextNonTextChild This would reduce the number of expensive Perl <-> library interactions (speeding-up the application), and simplify my code a lot.

Tue Mar 04 17:15:37 2008 pajas [...] matfyz.cz - Correspondence added

Dne út 04.bře.2008 16:54:03, MARKOV napsal(a): Show quoted text

> XML::Compile walks through many XML documents where white-space is > unimportant. However, now it has to skip over ::Text nodes which are > produced by the XML::LibXML::Parser. > > It would be nice to have a > $node->getNonTextChilds > and $node->firstNonTextChild > and $node->nextNonTextChild > > This would reduce the number of expensive Perl <-> library

interactions Show quoted text

> (speeding-up the application), and simplify my code a lot.

What about using $parser->keep_blanks(0); ? Isn't that what you want instead? -- Petr

Tue Mar 04 17:15:38 2008 The RT System itself - Status changed from 'new' to 'open'

Tue Mar 04 17:25:24 2008 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Tue, 4 Mar 2008 23:24:40 +0100
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080304 22:15]: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > Dne út 04.bře.2008 16:54:03, MARKOV napsal(a):

> > XML::Compile walks through many XML documents where white-space is > > unimportant. However, now it has to skip over ::Text nodes which are > > produced by the XML::LibXML::Parser. > > > > It would be nice to have a > > $node->getNonTextChilds > > and $node->firstNonTextChild > > and $node->nextNonTextChild

> > What about using $parser->keep_blanks(0); ? Isn't that what you want > instead?

I didn't find that one before. You're a genious ;-) But how does that behave in a 'mixed' element? Can you switch it on and off at will while walking through the tree? -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Wed Mar 05 02:57:39 2008 christian.glahn [...] lo-f.at - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Wed, 05 Mar 2008 08:58:16 +0100
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Christian Glahn <christian.glahn [...] lo-f.at>

Hi Mark, On Tue, 2008-03-04 at 17:25 -0500, Mark Overmeer via RT wrote: Show quoted text

> Queue: XML-LibXML > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > * Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [080304 22:15]:

> > > > <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > > > Dne út 04.bře.2008 16:54:03, MARKOV napsal(a):

> > > XML::Compile walks through many XML documents where white-space is > > > unimportant. However, now it has to skip over ::Text nodes which are > > > produced by the XML::LibXML::Parser. > > > > > > It would be nice to have a > > > $node->getNonTextChilds > > > and $node->firstNonTextChild > > > and $node->nextNonTextChild

> > > > What about using $parser->keep_blanks(0); ? Isn't that what you want > > instead?

> > I didn't find that one before. You're a genious ;-) > But how does that behave in a 'mixed' element? Can you switch it > on and off at will while walking through the tree?

Keep blanks is a parse time option. I.e. the parser removes what is called by the specs "ignorable whitespace". The behavior of this feature may lead to unexpected results (which are conforming the specs). It also means that you cannot switch it on or off once the XML has been parsed. However, to me the problem sounds a bit like a "tree walker" problem, which the iterator module tries to address. That would mean that you use an iterator on XPath results for traversing the DOM data structure instead of working directly on the DOM. This gives you greater flexibility than DOM would do alone. Therefore, I think that you should have a look at the DOM Iterator specs at W3.org and the XML::LibXML::Iterator Module. Cheers Christian

Wed Mar 05 14:24:35 2008 solutions [...] overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Wed, 5 Mar 2008 20:00:11 +0100
To:	Christian Glahn via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <solutions [...] overmeer.net>

* Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080305 07:57]: Show quoted text

> > > > It would be nice to have a > > > > $node->getNonTextChilds > > > > and $node->firstNonTextChild > > > > and $node->nextNonTextChild

Show quoted text

> > > What about using $parser->keep_blanks(0); ? Isn't that what you want > > > instead?

Show quoted text

> Keep blanks is a parse time option.

Then not usable for me. Show quoted text

> However, to me the problem sounds a bit like a "tree walker" problem, > which the iterator module tries to address. > > Therefore, I think that you should have a look at the DOM Iterator specs > at W3.org and the XML::LibXML::Iterator Module.

On the moment, I do have an iterator implementation, which is a bit more suitable for my needs: XML::Compile::Iterator. However, my point is that it retrieves a lot more nodes from the library than it needs to, which is costly. Every Perl iterator implementation is costly. So therefore my request for a simple way to only get the non-text child-nodes. The only thing I really need is a simplification for my @childs = grep { ! $_->isa('XML::LibXML::Text') } $node->getChilds; It would reduce the number of objects to be created (and skipped by me) by half. But I need this per node, so keep_blanks is apparently no option. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Wed Mar 05 18:05:58 2008 christian.glahn [...] lo-f.at - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Thu, 06 Mar 2008 00:06:31 +0100
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Christian Glahn <christian.glahn [...] lo-f.at>

On Wed, 2008-03-05 at 14:24 -0500, Mark Overmeer via RT wrote: Show quoted text

> Queue: XML-LibXML > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080305 07:57]:

> > > > > It would be nice to have a > > > > > $node->getNonTextChilds > > > > > and $node->firstNonTextChild > > > > > and $node->nextNonTextChild

>

> > > > What about using $parser->keep_blanks(0); ? Isn't that what you want > > > > instead?

>

> > Keep blanks is a parse time option.

> > Then not usable for me. >

> > However, to me the problem sounds a bit like a "tree walker" problem, > > which the iterator module tries to address. > > > > Therefore, I think that you should have a look at the DOM Iterator specs > > at W3.org and the XML::LibXML::Iterator Module.

> > On the moment, I do have an iterator implementation, which is a bit more > suitable for my needs: XML::Compile::Iterator. However, my point is that > it retrieves a lot more nodes from the library than it needs to, which > is costly. Every Perl iterator implementation is costly. So therefore > my request for a simple way to only get the non-text child-nodes. > > The only thing I really need is a simplification for > my @childs = grep { ! $_->isa('XML::LibXML::Text') } $node->getChilds; > It would reduce the number of objects to be created (and skipped by me) > by half. But I need this per node, so keep_blanks is apparently no option.

you may also try to use the findnodes() function of XML::LibXML with the following expression: my @childElements = $node->findnodes( '*[not(self::text())]' ); this expression removes all text nodes from the child nodes of a node, and should be a tiny bit faster than the expression you use, because it runs entirely on libxml2 level. However, I did not benchmark it. Christian

Wed Mar 05 18:43:11 2008 christian.glahn [...] lo-f.at - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Thu, 06 Mar 2008 00:43:57 +0100
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Christian Glahn <christian.glahn [...] lo-f.at>

I got curious if I produced complete non-sense in my previous reply and so benchmark all three options for filtering child elements. To my own surprise I got the following results: -----8<-----8<------8<------8<------ $ time perl -MXML::LibXML \ -e '$doc = XML::LibXML->new->parse_string("<a> <c/> <d/> </a>"); @n = grep { !$_->isa("XML::LibXML::TextNode")} $doc->documentElement()->childNodes() for (1..100000);' real 0m3.762s user 0m3.752s sys 0m0.008s $ time perl -MXML::LibXML \ -e '$doc = XML::LibXML->new->parse_string("<a> <c/> <d/> </a>"); @n = grep { $_->nodeType != XML_TEXT_NODE } $doc->documentElement()->childNodes() for (1..100000);' real 0m4.024s user 0m3.992s sys 0m0.032s $ time perl -MXML::LibXML \ -e '$doc = XML::LibXML->new->parse_string("<a> <c/> <d/> </a>"); @n = $doc->documentElement()->findnodes( "*[not(self::text())]") for (1..100000);' real 0m5.705s user 0m5.668s sys 0m0.036s -----8<-----8<------8<------8<------ In order to be sure I increased the number of child nodes. The result is the following: -----8<-----8<------8<------8<------ $ time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> </a>"); @n = $doc->documentElement()->findnodes( "*[not(self::text())]") for (1..100000);' real 0m9.785s user 0m9.721s sys 0m0.060s $ time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> </a>"); @n = grep { ! $_->isa("XML::LibXML::TextNode")} $doc->documentElement()->childNodes() for (1..100000);' real 0m13.276s user 0m13.197s sys 0m0.080s $ time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/> </a>"); @n = grep { $_->nodeType != XML_TEXT_NODE } $doc->documentElement()->childNodes() for (1..100000);' real 0m14.687s user 0m14.589s sys 0m0.100s -----8<-----8<------8<------8<------ The results mean that on small numbers of child nodes, the XPath overhead masks its selection performance, while with increasing numbers of child nodes the library's XPath engine outperforms perl's grep. Christian On Wed, 2008-03-05 at 18:05 -0500, Christian Glahn via RT wrote: Show quoted text

> Queue: XML-LibXML > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > > On Wed, 2008-03-05 at 14:24 -0500, Mark Overmeer via RT wrote:

> > Queue: XML-LibXML > > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > > > * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080305 07:57]:

> > > > > > It would be nice to have a > > > > > > $node->getNonTextChilds > > > > > > and $node->firstNonTextChild > > > > > > and $node->nextNonTextChild

> >

> > > > > What about using $parser->keep_blanks(0); ? Isn't that what you want > > > > > instead?

> >

> > > Keep blanks is a parse time option.

> > > > Then not usable for me. > >

> > > However, to me the problem sounds a bit like a "tree walker" problem, > > > which the iterator module tries to address. > > > > > > Therefore, I think that you should have a look at the DOM Iterator specs > > > at W3.org and the XML::LibXML::Iterator Module.

> > > > On the moment, I do have an iterator implementation, which is a bit more > > suitable for my needs: XML::Compile::Iterator. However, my point is that > > it retrieves a lot more nodes from the library than it needs to, which > > is costly. Every Perl iterator implementation is costly. So therefore > > my request for a simple way to only get the non-text child-nodes. > > > > The only thing I really need is a simplification for > > my @childs = grep { ! $_->isa('XML::LibXML::Text') } $node->getChilds; > > It would reduce the number of objects to be created (and skipped by me) > > by half. But I need this per node, so keep_blanks is apparently no option.

> > you may also try to use the findnodes() function of XML::LibXML with > the following expression: > > my @childElements = $node->findnodes( '*[not(self::text())]' ); > > this expression removes all text nodes from the child nodes of a > node, and should be a tiny bit faster than the expression you use, > because it runs entirely on libxml2 level. However, I did not benchmark it. > > Christian

Thu Mar 06 04:25:23 2008 pajas [...] matfyz.cz - Correspondence added

Hi Christian, what baffles me completely about this is that $node->isa("XML::LibXML::Text") seems to be faster than $node->nodeType == XML_LIBXML_TEXT. -- Petr

Thu Mar 06 14:00:45 2008 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Thu, 6 Mar 2008 19:59:47 +0100
To:	Christian Glahn via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080305 23:43]: Show quoted text

> I got curious if I produced complete non-sense in my previous reply and > so benchmark all three options for filtering child elements.

Show quoted text

> grep { !$_->isa("XML::LibXML::TextNode")} > user 0m3.752s > > grep { $_->nodeType != XML_TEXT_NODE } > user 0m3.992s

As Petr already mentioned, the call to nodeType() is more expensive than the isa(). Probably this is caused by the delay in the XS layer which creates an SV, where isa() is a simple lookup. Show quoted text

> @n = $doc->documentElement()->findnodes( "*[not(self::text())]") > user 0m5.668s

Show quoted text

> In order to be sure I increased the number of child nodes.

Show quoted text

> $doc->documentElement()->findnodes( "*[not(self::text())]") > user 0m9.721s > > grep { ! $_->isa("XML::LibXML::TextNode")} > user 0m13.197s > > grep { $_->nodeType != > XML_TEXT_NODE } > user 0m14.589s > > The results mean that on small numbers of child nodes, the XPath > overhead masks its selection performance, while with increasing numbers > of child nodes the library's XPath engine outperforms perl's grep.

This is exactly my point: even an expensive xpath expression in the C library is faster than a simple filter in Perl. Probably a simple C expression will gain even more. But the most important gain is in simpler Perl code. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions drs Mark A.C.J. Overmeer MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Mar 06 14:02:16 2008 christian.glahn [...] lo-f.at - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Thu, 06 Mar 2008 20:02:33 +0100
To:	bug-XML-LibXML [...] rt.cpan.org
From:	Christian Glahn <christian.glahn [...] lo-f.at>

Hi Petr, On Thu, 2008-03-06 at 04:25 -0500, Petr Pajas via RT wrote: Show quoted text

> Queue: XML-LibXML > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=33811 > > > Hi Christian, > > what baffles me completely about this is that > $node->isa("XML::LibXML::Text") seems to be faster than > $node->nodeType == XML_LIBXML_TEXT.

I was surprised, too. I expected that an integer comparison would be faster. Maybe someone has a look into the isa() implementation ;) By the way, another remark for Mark. You may speed up the XPath based text node filtering, if you use Petr's XML::LibXML::XPathContext class and use the same pre-compiled xpath statement for each request. This will reduce some of the overhead that is part of the findnodes() function. Christian

Fri Mar 07 10:32:21 2008 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Fri, 7 Mar 2008 15:58:20 +0100
To:	Christian Glahn via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080306 19:02]: Show quoted text

> > $node->isa("XML::LibXML::Text") seems to be faster than > > $node->nodeType == XML_LIBXML_TEXT.

> > I was surprised, too. I expected that an integer comparison would be > faster. Maybe someone has a look into the isa() implementation ;) > > You may speed up the XPath based text node filtering, if you use Petr's > XML::LibXML::XPathContext class and use the same pre-compiled xpath > statement for each request. This will reduce some of the overhead that > is part of the findnodes() function.

99.9% of the coders will only use preformance improvements if they are easy to use and easy to maintain. This XPath work-around is quite tricky. The requirement to selectively skip text nodes is required in all document-style XML readers which are schema based. So: a simple extension to the interface as requested is useful for many. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sun Oct 04 12:28:45 2009 pajas [...] matfyz.cz - Correspondence added

Hi, returning to this bug, I wonder that maybe what you actually want is $node->getChildrenByTagNameNS. This method is implemented in C, very fast, exp. when used with wildcards for both local-name and namespace, see this comparison: time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> </a>"); @n =$doc->documentElement()->findnodes( "*[not(self::text())]") for (1..100000);' real 0m4.669s user 0m4.632s sys 0m0.028s pajas@stain: ~/projects/XML-LibXML-devel/XML-LibXML $ time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> </a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(q(* *)) for (1..100000);' real 0m0.867s user 0m0.836s sys 0m0.016s On the other hand, what you ask for in this ticket is just filter out text nodes, whereas this method filters out everything but elements (text nodes, cdata, comments, PIs). I'm not actually willing to add exactly the methods you ask for, since they skip all text nodes (that may or may not contain data). Next time somebody will ask for methods that filter out comments, cdata, PIs, etc. or some specific subset of these. On the other hand I believe the actual common use case is iterating over non-blank child nodes, i.e. child nodes that are either of other type than text or are text nodes containing only white-space. So what about adding these (non-DOM) methods: firstNonBlankChild, nextNonBlankSibling and previousNonBlankSibling Please comment, -- Petr Dne pá 07.bře.2008 10:32:21, Mark@Overmeer.net napsal(a): Show quoted text

> * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080306 19:02]:

> > > $node->isa("XML::LibXML::Text") seems to be faster than > > > $node->nodeType == XML_LIBXML_TEXT.

> > > > I was surprised, too. I expected that an integer comparison would be > > faster. Maybe someone has a look into the isa() implementation ;) > > > > You may speed up the XPath based text node filtering, if you use Petr's > > XML::LibXML::XPathContext class and use the same pre-compiled xpath > > statement for each request. This will reduce some of the overhead that > > is part of the findnodes() function.

> > 99.9% of the coders will only use preformance improvements if they > are easy to use and easy to maintain. This XPath work-around is quite > tricky. The requirement to selectively skip text nodes is required in > all document-style XML readers which are schema based. So: a simple > extension to the interface as requested is useful for many.

Sun Oct 04 17:22:03 2009 pajas [...] matfyz.cz - Correspondence added

Please try the SVN version if you can, possibly providing feedback by reopening this bug. I added four new convenience methods: nonBlankChildNodes, firstNonBlankChild, nextNonBlankSibling, previousNonBlankSibling. I'm using xmlIsNonBlank to test for "blankness". It considers empty or white-space only Text or CDATA nodes as blanks. My first idea was that CDATA should never be considered blank, but that actually depends on the model and the parser has a flag to turn CDATA into text nodes, so it actually makes no sense to make the distinction. -- p Dne ne 04.říj.2009 12:28:45, PAJAS napsal(a): Show quoted text

> Hi, returning to this bug, I wonder that maybe what you actually want is > $node->getChildrenByTagNameNS. This method is implemented in C, very > fast, exp. when used with wildcards for both local-name and namespace, > see this comparison: > > time perl -MXML::LibXML -e '$doc = > XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> > <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> > </a>"); @n =$doc->documentElement()->findnodes( "*[not(self::text())]") > for (1..100000);' > > real 0m4.669s > user 0m4.632s > sys 0m0.028s > pajas@stain: ~/projects/XML-LibXML-devel/XML-LibXML > $ time perl -MXML::LibXML -e '$doc = > XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> > <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> > </a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(q(* *)) for > (1..100000);' > > real 0m0.867s > user 0m0.836s > sys 0m0.016s > > On the other hand, what you ask for in this ticket is just filter out > text nodes, whereas this method filters out everything but elements > (text nodes, cdata, comments, PIs). > > I'm not actually willing to add exactly the methods you ask for, since > they skip all text nodes (that may or may not contain data). Next time > somebody will ask for methods that filter out comments, cdata, PIs, etc. > or some specific subset of these. > > On the other hand I believe the actual common use case is iterating over > non-blank child nodes, i.e. child nodes that are either of other type > than text or are text nodes containing only white-space. So what about > adding these (non-DOM) methods: > > firstNonBlankChild, nextNonBlankSibling and previousNonBlankSibling > > Please comment, > > -- Petr > > Dne pá 07.bře.2008 10:32:21, Mark@Overmeer.net napsal(a):

> > * Christian Glahn via RT (bug-XML-LibXML@rt.cpan.org) [080306 19:02]:

> > > > $node->isa("XML::LibXML::Text") seems to be faster than > > > > $node->nodeType == XML_LIBXML_TEXT.

> > > > > > I was surprised, too. I expected that an integer comparison would be > > > faster. Maybe someone has a look into the isa() implementation ;) > > > > > > You may speed up the XPath based text node filtering, if you use

Petr's Show quoted text

> > > XML::LibXML::XPathContext class and use the same pre-compiled xpath > > > statement for each request. This will reduce some of the overhead that > > > is part of the findnodes() function.

> > > > 99.9% of the coders will only use preformance improvements if they > > are easy to use and easy to maintain. This XPath work-around is quite > > tricky. The requirement to selectively skip text nodes is required in > > all document-style XML readers which are schema based. So: a simple > > extension to the interface as requested is useful for many.

> >

Sun Oct 04 17:22:05 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'resolved'

Sun Oct 04 18:04:13 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #33811] need for ::Text filter
Date:	Mon, 5 Oct 2009 00:03:52 +0200
To:	Petr Pajas via RT <bug-XML-LibXML [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Petr Pajas via RT (bug-XML-LibXML@rt.cpan.org) [091004 16:28]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=33811 > > time perl -MXML::LibXML -e '$doc = > XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> > <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> > </a>"); @n =$doc->documentElement()->findnodes( "*[not(self::text())]") > for (1..100000);' > > real 0m4.669s > user 0m4.632s > sys 0m0.028s

This is over 20% faster than my current $ time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> </a>"); @n = grep {$_->isa("XML::LibXML::Element")} $doc->documentElement()->childNodes for (1..100000);' Show quoted text

> pajas@stain: ~/projects/XML-LibXML-devel/XML-LibXML > $ time perl -MXML::LibXML -e '$doc = > XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> > <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> > </a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(q(* *)) for > (1..100000);' > > real 0m0.867s > user 0m0.836s > sys 0m0.016s

This is faster, but does not collect any elements in @n. Mistake? Show quoted text

> I'm not actually willing to add exactly the methods you ask for, since > they skip all text nodes (that may or may not contain data). Next time > somebody will ask for methods that filter out comments, cdata, PIs, etc. > or some specific subset of these.

Yes, this is a sane expectation. White-space layout however, does have a special role in XML. It is the difference between toString(1) and toString(0). Besides, we doe have a switch for it on $parser level: $parser->keep_blanks(0); Show quoted text

> On the other hand I believe the actual common use case is iterating over > non-blank child nodes, i.e. child nodes that are either of other type > than text or are text nodes containing only white-space. So what about > adding these (non-DOM) methods: > > firstNonBlankChild, nextNonBlankSibling and previousNonBlankSibling

For me, I would favor nonBlankChilds(): as few XS calls as possible. It would probably save a considerable amount of time. Based on some $read_mixed_data parameter, I need to switch between the "take blanks" and "not take blanks" calls. But this may be the start of a "method flood" as you expected. Wouldn't it be a smart move to add filter options like this: $node->childNodes(XML_NON_BLANK_TEXT|XML_PI|XML_ELEMENT); or a more general $parser->nodeFilter(XML_NON_BLANK_TEXT|XML_ELEMENT); $node->childNodes(); Of course, this would impact a lot of methods, which all need to call the filter before returning elements. Not as 'parse-time' option, but run-time filter. But I expect that quite a lot of programs have to implement this filtering now by themselves. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sun Oct 04 18:04:13 2009 The RT System itself - Status changed from 'resolved' to 'open'

Sun Oct 04 19:05:36 2009 pajas [...] matfyz.cz - Correspondence added

Dne ne 04.říj.2009 18:04:13, Mark@Overmeer.net napsal(a): Show quoted text

> > pajas@stain: ~/projects/XML-LibXML-devel/XML-LibXML > > $ time perl -MXML::LibXML -e '$doc = > > XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> > > <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> > > </a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(q(* *))

> for

> > (1..100000);' > > > > real 0m0.867s > > user 0m0.836s > > sys 0m0.016s

> > This is faster, but does not collect any elements in @n. Mistake?

Opps, written q(* *) where I meant qw(* *), i.e. ('*','*'). The benchmark should read: time perl -MXML::LibXML -e '$doc = XML::LibXML->new->parse_string("<a><c/> <d/> <c/> <d/> <c/> <d/> <c/> <d/><c/> <d/> <c/> <d/> </a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(qw(* *)) for (1..100000);' real 0m2.427s user 0m2.328s sys 0m0.044s Show quoted text

> > On the other hand I believe the actual common use case is iterating

> over

> > non-blank child nodes, i.e. child nodes that are either of other

> type

> > than text or are text nodes containing only white-space. So what

> about

> > adding these (non-DOM) methods: > > > > firstNonBlankChild, nextNonBlankSibling and previousNonBlankSibling

> > For me, I would favor nonBlankChilds(): as few XS calls as possible.

yes, I added that one too (as nonBlankChildNodes). Show quoted text

> It would probably save a considerable amount of time.

It's in the SVN, benchmark for real. On the above test case I get similar results to getChildrenByTagNameNS(qw(* *)) Show quoted text

> Based on some > $read_mixed_data parameter, I need to switch between the "take blanks" > and "not take blanks" calls. > > But this may be the start of a "method flood" as you expected. > > Wouldn't it be a smart move to add filter options like this: > $node->childNodes(XML_NON_BLANK_TEXT|XML_PI|XML_ELEMENT);

Maybe, but this starts to be real ugly. First this modifies/extends a method defined by the DOM spec which can be extended by some future DOM spec in a different manner, so it would better be a different method; 2nd it adds another set of constants (the nodetype contsants can't be or-ed), etc. etc. Show quoted text

> or a more general > $parser->nodeFilter(XML_NON_BLANK_TEXT|XML_ELEMENT); > $node->childNodes();

This won't work:-) The parser object does not serve as a pool of global variables to other parts of the interface. Once the DOM tree is built it has no relation to it. It is just a parser. And in fact, modifying global behavior in this way is generally a bad idea since you don't know what other components of the program may be using the module and expect it to behave differently. Show quoted text

> Of course, this would impact a lot of methods, which all need to > call the filter before returning elements. > Not as 'parse-time' option, > but run-time filter. But I expect that quite a lot of programs have > to implement this filtering now by themselves.

Generally, a really bad idea from where I'm standing! To sum up, something like $node->childNodes(XML_NON_BLANK_TEXT|XML_PI|XML_ELEMENT) is possible, but ugly. As previously discussed, XPath gives cleaner and considerably more general solution for this kind of problems, esp. when used with pre-compiled XPath expresions to partly reduce the overhead. Also, modules like XML::CompactTree::XS can be used to slurp XML data into Perl data structures for further processing with just one XS call. Of course. where really optimal preformance is needed, there is probably nothing that beats writing the whole code in C (and there are solutions for doing that right from Perl as well). -- Petr

Sun Oct 04 19:06:12 2009 pajas [...] matfyz.cz - Status changed from 'open' to 'resolved'