Bug #69096 for XML-LibXML: findvalue from XML::LibXML 1.74 is very slow (regression)

Sun Jun 26 12:18:51 2011 vincent [...] vinc17.org - Ticket created

Subject:

findvalue from XML::LibXML 1.74 is very slow (regression)

With XML::LibXML 1.74, findvalue is very slow, at least on large files. See the attached testcase. Here are the timings on my Mac OS X 10.4.11 PowerPC machine, with XML::LibXML from MacPorts: * 0.51s user time with XML::LibXML 1.70 * 6.04s user time with XML::LibXML 1.74 Perl version: This is perl 5, version 12, subversion 3 (v5.12.3) built for darwin-multi-2level

Subject:

libxml-findvalue.pl

#!/usr/bin/env perl use strict; use XML::LibXML; my $file = 'test-huge.xml'; -e $file and die; open FILE, '>', $file or die; print FILE "<?xml version=\"1.0\"?>\n<root>\n" or die; foreach (1..6000) { print FILE "<a attr=\"test\"/>\n" or die; } print FILE "</root>\n" or die; close FILE or die; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file); my $n = 0; foreach my $node ($doc->findnodes('//a')) { my $v = $node->findvalue('@attr'); $v eq 'test' or die; $n++; } print "$n\n"; unlink $file;

Sun Jun 26 12:38:36 2011 vincent [...] vinc17.org - Correspondence added

From:

vincent [...] vinc17.org

I didn't see XML::LibXML 1.75 was available, but this version is also affected.

Sun Jun 26 12:38:37 2011 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 29 16:10:07 2011 stolen-from-bitcard [...] l2g.to - Correspondence added

From:

stolen-from-bitcard [...] l2g.to

I checked this out from Bitbucket and ran the sample script against the code, using Mercurial's bisect to narrow it down and re-running the script each time. It looks like this change is the culprit: https://bitbucket.org/shlomif/perl-xml-libxml/changeset/3c6332236d00

Wed Jun 29 17:51:06 2011 daniel.frett [...] ccci.org - Correspondence added

From:

daniel.frett [...] ccci.org

On Wed Jun 29 16:10:07 2011, l2g wrote: Show quoted text

> I checked this out from Bitbucket and ran the sample script against the > code, using Mercurial's bisect to narrow it down and re-running the > script each time. > > It looks like this change is the culprit: > > https://bitbucket.org/shlomif/perl-xml-libxml/changeset/3c6332236d00

I would guess that findvalue is triggering node normalization, before the patch node normalization would quit processing after encountering the first <a> node and ignore normalization of the rest of the document which would explain the speed difference. I'm not completely sure if findvalue needs to normalize the document or not, but xpath can be an expensive operation to run on each loop iteration, you might want to look into utilizing getAttribute if you are just looking for an attribute value on the nodes being processed.

Wed Jun 29 19:35:22 2011 vincent [...] vinc17.org - Correspondence added

From:

vincent [...] vinc17.net

Le Mer 29 Juin 2011 17:51:06, dfrett a écrit : Show quoted text

> I'm not completely sure if findvalue needs to normalize the document > or not,

I don't know, but it seems to do a lot of unnecessary work (or it is very inefficient). Show quoted text

> but xpath can be an expensive operation to run on each loop iteration,

It can be expensive with some expressions, but here it shouldn't (there can be just some overhead, that shouldn't be noticeable). The following XSLT stylesheet does something more or less equivalent as far as XML processing is concerned: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text" indent="no"/> <xsl:template match="/"> <xsl:for-each select="//a"> <xsl:variable name="v"> <xsl:value-of select="@attr"/> </xsl:variable> <xsl:if test="$v != 'test'"> <xsl:message terminate="yes">Error!</xsl:message> </xsl:if> </xsl:for-each> </xsl:template> </xsl:stylesheet> and with xsltproc, it is very fast. This shows that the XPath expression should not be a problem. Show quoted text

> you might want to look into utilizing getAttribute if you are just > looking for an attribute value on the nodes being processed.

Sometimes it can be used without much drawback, but sometimes it can render the code less maintainable. For instance, if I change an attribute "foo" by a child "foo" in the XML document, I can just change the XPath expression "@foo" by "foo". Saying not to use XPath is like saying not to use high-level programming languages.

Wed Jun 29 20:46:54 2011 daniel.frett [...] ccci.org - Correspondence added

From:

daniel.frett [...] ccci.org

I did a bit more digging after my last reply, it appears that the node normalization is something specific to the perl implementation only. It was implemented when the underlying _find function was first added to the perl implementation with https://bitbucket.org/shlomif/perl-xml-libxml/changeset/25f7f6f782cc What it currently does is examine the entire DOM and look for adjacent text nodes it can merge together (which is slow on large documents for obvious reasons). it's definitely something that doesn't need to be run for every XPath operation. The only possible reason I could see for needing to run this is if the XPath expression interacts with text nodes and a targeted text node has adjacent nodes that changes the evaluation of the XPath expression. I attached a script that doesn't work in XML::LibXML 1.70, but works in XML::LibXML >1.73 that illustrates this potential issue. an alternative solution would be to run the node normalization code when nodes are added to the DOM, I'm not completely sure what overhead this would add.

Subject:

libxmlNormalizeXpath.pl

#!/usr/bin/perl use strict; use XML::LibXML; my $dom = XML::LibXML::Document->createDocument('1.0', 'utf-8'); my $root = $dom->createElement('root'); $dom->setDocumentElement($root); my $e = $dom->createElement("child"); my $e2 = $dom->createElement("child"); my $t1 = $dom->createTextNode( "te" ); my $t2 = $dom->createTextNode( "st" ); $root->appendChild($e); $root->appendChild($e2); $e2->appendChild($t1); $e2->appendChild($t2); #print $t1->toString(), "\n"; #print $t2->toString(), "\n"; foreach ($dom->findnodes('//child[text()="test"]')) { print $_->toString(), "\n"; } #print $t1->toString(), "\n"; #print $t2->toString(), "\n";

Wed Jun 29 21:18:07 2011 vincent [...] vinc17.org - Correspondence added

From:

vincent [...] vinc17.org

Le Mer 29 Juin 2011 20:46:54, dfrett a écrit : Show quoted text

> I did a bit more digging after my last reply, it appears that the node > normalization is something specific to the perl implementation only.

[...] By node normalization, you mean adjacent text node merging? Or does the implementation do something else? Show quoted text

> What it currently does is examine the entire DOM and look for adjacent > text nodes it can merge together (which is slow on large documents > for obvious reasons). it's definitely something that doesn't need to > be run for every XPath operation.

Yes, that's a waste of time. [...] Show quoted text

> an alternative solution would be to run the node normalization code > when nodes are added to > the DOM, I'm not completely sure what overhead this would add.

I don't know the internals, but I would say that there would be two solutions: 1. Make sure that the document is always normalized by running the node normalization code each time it is needed (e.g. when a text node is added). 2. Do lazy node normalization by having a flag attached to the owner document that tells whether the document needs normalization, and run the node normalization code when the operation needs a normalized document. Perhaps see what solution has been chosen by other implementations. But I suppose that this is now already handled by the libxml2 library itself[*]. So, why isn't it used? [*] There was a thread about that: http://veillard.com/XML/messages/1020.html http://veillard.com/XML/messages/1023.html

Wed Jun 29 21:39:56 2011 daniel.frett [...] ccci.org - Correspondence added

From:

daniel.frett [...] ccci.org

On Wed Jun 29 21:18:07 2011, vincent@vinc17.org wrote: Show quoted text

> By node normalization, you mean adjacent text node merging? Or does the > implementation do something else?

it only does adjacent text node merging Show quoted text

> I don't know the internals, but I would say that there would be two > solutions: > > 1. Make sure that the document is always normalized by running the node > normalization code each time it is needed (e.g. when a text node is added). > > 2. Do lazy node normalization by having a flag attached to the owner > document that tells whether the document needs normalization, and run > the node normalization code when the operation needs a normalized document. >

I'm thinking #1 is the way to go, I would assume less parts of the code need to be touched in that scenario. It looks like all the operations that attach new nodes pass through a few common functions. I haven't checked yet, but I'm hoping that the xml parsing code normalizes text nodes before passing the DOM back to the perl glue layer. Show quoted text

> Perhaps see what solution has been chosen by other implementations. > But I suppose that this is now already handled by the libxml2 library > itself[*]. So, why isn't it used? >

according to the documentation it isn't as strict and doesn't free the original text nodes because they could still be referenced in perl (http://search.cpan.org/~shlomif/XML-LibXML- 1.75/lib/XML/LibXML/Node.pod#normalize) I'm going to try and get a patch written up to address this issue by normalizing text nodes when they are added.

Thu Jun 30 00:03:05 2011 daniel.frett [...] ccci.org - Correspondence added

From:

daniel.frett [...] ccci.org

While glancing over the normalize tests I noticed that it explicitly expects text nodes to not be merged until normalize is called, so i decided to do a bit of reading of the DOM spec. normalize is actually a DOM method for the purpose of merging adjacent text nodes (which implies that there can be adjacent text nodes that aren't merged) From the DOM level 3 spec (http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID- C74D1578) Section 1.2.1 "Note that, with the exceptions of Document.normalizeDocument() and Node.normalize(), manipulating characters using DOM methods does not guarantee to preserve a fully- normalized text." with that in mind and the fact that performing text node normalization changes the actual DOM which may have other side effects for applications, I think the responsibility to normalize a document before performing XPath queries should fall on the application. Plus moving the responsibility to the application layer guarantees that normalization won't be accidentally omitted if the application is switched to another DOM compliant parser that doesn't automatically normalize the text nodes. This change could potentially break existing code, but that code was probably already broken due to the normalize method only working in very specific scenarios before 1.73 -Daniel Frett

Thu Jun 30 03:50:05 2011 vincent [...] vinc17.org - Correspondence added

From:

vincent [...] vinc17.org

Le Jeu 30 Juin 2011 00:03:05, dfrett a écrit : Show quoted text

> normalize is actually a DOM method for the purpose of merging adjacent > text nodes (which > implies that there can be adjacent text nodes that aren't merged)

OK, and in fact this DOM method is already supported by XML::LibXML, as documented in the XML::LibXML::Node(3pm) man page: normalize $node->normalize; This function normalizes adjacent text nodes. This function is not as strict as libxml2's xmlTextMerge() function, since it will not free a node that is still referenced by the perl layer. So, a DOM-compliant application should already use it when needed. $doc->normalizeDocument is missing, though. Show quoted text

> This change could potentially break existing code, but that code was > probably already broken > due to the normalize method only working in very specific scenarios > before 1.73

Yes. And existing code based on implicit normalization is not DOM compliant.

Sat Jul 09 12:18:07 2011 SHLOMIF [...] cpan.org - Correspondence added

Hi Daniel, On Thu Jun 30 00:03:05 2011, dfrett wrote: Show quoted text

> While glancing over the normalize tests I noticed that it explicitly > expects text nodes to not > be merged until normalize is called, so i decided to do a bit of > reading of the DOM spec. > > normalize is actually a DOM method for the purpose of merging adjacent > text nodes (which > implies that there can be adjacent text nodes that aren't merged) > > From the DOM level 3 spec (http://www.w3.org/TR/DOM-Level-3- > Core/core.html#ID- > C74D1578) Section 1.2.1 > > "Note that, with the exceptions of Document.normalizeDocument() and > Node.normalize(), > manipulating characters using DOM methods does not guarantee to > preserve a fully- > normalized text." > > with that in mind and the fact that performing text node normalization > changes the actual > DOM which may have other side effects for applications, I think the > responsibility to > normalize a document before performing XPath queries should fall on > the application. Plus > moving the responsibility to the application layer guarantees that > normalization won't be accidentally omitted if the application is > switched to another DOM compliant parser that > doesn't automatically normalize the text nodes. > > This change could potentially break existing code, but that code was > probably already broken > due to the normalize method only working in very specific scenarios > before 1.73 > > -Daniel Frett

Can you prepare that modification for release, and send me a BitBucket pull request? I'd like to close this bug. Regards, -- Shlomi Fish

Tue Jul 12 09:18:24 2011 daniel.frett [...] ccci.org - Correspondence added

From:

daniel.frett [...] ccci.org

On Sat Jul 09 12:18:07 2011, SHLOMIF wrote: Show quoted text

> Can you prepare that modification for release, and send me a BitBucket > pull request? I'd like to close this bug. > > Regards, > > -- Shlomi Fish

Just sent a pull request, I had gotten sidetracked with my day-to-day job ;) -Daniel

Tue Jul 12 16:01:32 2011 SHLOMIF [...] cpan.org - Correspondence added

On Tue Jul 12 09:18:24 2011, dfrett wrote: Show quoted text

> On Sat Jul 09 12:18:07 2011, SHLOMIF wrote:

> > Can you prepare that modification for release, and send me a BitBucket > > pull request? I'd like to close this bug. > > > > Regards, > > > > -- Shlomi Fish

> > Just sent a pull request, I had gotten sidetracked with my day-to-day

job ;) Show quoted text

>

Thanks! I applied it and added a Changes entry. Will make a new release soon. Regards, -- Shlomi Fish Show quoted text

> -Daniel

Tue Jul 12 16:01:33 2011 SHLOMIF [...] cpan.org - Status changed from 'open' to 'resolved'