Dne ne 04.říj.2009 18:04:13, Mark@Overmeer.net napsal(a):
Show quoted text> > pajas@stain: ~/projects/XML-LibXML-devel/XML-LibXML
> > $ time perl -MXML::LibXML -e '$doc =
> > XML::LibXML->new->parse_string("<a><b><c/></b> <d/> <b><c/></b> <d/>
> > <b><c/></b> <d/> <b><c/></b> <d/><b><c/></b> <d/> <b><c/></b> <d/>
> > </a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(q(* *))
> for
> > (1..100000);'
> >
> > real 0m0.867s
> > user 0m0.836s
> > sys 0m0.016s
>
> This is faster, but does not collect any elements in @n. Mistake?
Opps, written q(* *) where I meant qw(* *), i.e. ('*','*'). The
benchmark should read:
time perl -MXML::LibXML -e '$doc =
XML::LibXML->new->parse_string("<a><b><c/></b> <d/> <b><c/></b> <d/>
<b><c/></b> <d/> <b><c/></b> <d/><b><c/></b> <d/> <b><c/></b> <d/>
</a>"); @n =$doc->documentElement()->getChildrenByTagNameNS(qw(* *)) for
(1..100000);'
real 0m2.427s
user 0m2.328s
sys 0m0.044s
Show quoted text> > On the other hand I believe the actual common use case is iterating
> over
> > non-blank child nodes, i.e. child nodes that are either of other
> type
> > than text or are text nodes containing only white-space. So what
> about
> > adding these (non-DOM) methods:
> >
> > firstNonBlankChild, nextNonBlankSibling and previousNonBlankSibling
>
> For me, I would favor nonBlankChilds(): as few XS calls as possible.
yes, I added that one too (as nonBlankChildNodes).
Show quoted text> It would probably save a considerable amount of time.
It's in the SVN, benchmark for real. On the above test case I get
similar results to getChildrenByTagNameNS(qw(* *))
Show quoted text> Based on some
> $read_mixed_data parameter, I need to switch between the "take blanks"
> and "not take blanks" calls.
>
> But this may be the start of a "method flood" as you expected.
>
> Wouldn't it be a smart move to add filter options like this:
> $node->childNodes(XML_NON_BLANK_TEXT|XML_PI|XML_ELEMENT);
Maybe, but this starts to be real ugly. First this modifies/extends a
method defined by the DOM spec which can be extended by some future DOM
spec in a different manner, so it would better be a different method;
2nd it adds another set of constants (the nodetype contsants can't be
or-ed), etc. etc.
Show quoted text> or a more general
> $parser->nodeFilter(XML_NON_BLANK_TEXT|XML_ELEMENT);
> $node->childNodes();
This won't work:-) The parser object does not serve as a pool of global
variables to other parts of the interface. Once the DOM tree is built it
has no relation to it. It is just a parser.
And in fact, modifying global behavior in this way is generally a bad
idea since you don't know what other components of the program may be
using the module and expect it to behave differently.
Show quoted text> Of course, this would impact a lot of methods, which all need to
> call the filter before returning elements.
> Not as 'parse-time' option,
> but run-time filter. But I expect that quite a lot of programs have
> to implement this filtering now by themselves.
Generally, a really bad idea from where I'm standing!
To sum up, something like
$node->childNodes(XML_NON_BLANK_TEXT|XML_PI|XML_ELEMENT) is possible,
but ugly. As previously discussed, XPath gives cleaner and considerably
more general solution for this kind of problems, esp. when used with
pre-compiled XPath expresions to partly reduce the overhead.
Also, modules like XML::CompactTree::XS can be used to slurp XML data
into Perl data structures for further processing with just one XS call.
Of course. where really optimal preformance is needed, there is probably
nothing that beats writing the whole code in C (and there are solutions
for doing that right from Perl as well).
-- Petr