Bug #71139 for Pod-Parser: 3 possible optimisations

Thu Sep 22 05:54:45 2011 nwc10+please+use+perlbug+for+perl+queries [...] colon.colondot.net - Ticket created

Subject:

3 possible optimisations

In core, t/porting/podcheck.t is slow. Slow as in roughly a minute runtime, which is frustrating. But to be fair to it, it's doing a *lot* of work. I ran it under Devel::NYTProf. [Runtime there about 150s] There's nothing obvious. However, these 3 changes seem to help. [diff is from the path in the core - I hope that patch -p3 will apply it cleanly. All tests pass in the core] diff --git a/cpan/Pod-Parser/lib/Pod/Checker.pm b/cpan/Pod-Parser/lib/Pod/Checker.pm index a230542..f901802 100644 --- a/cpan/Pod-Parser/lib/Pod/Checker.pm +++ b/cpan/Pod-Parser/lib/Pod/Checker.pm @@ -1099,7 +1099,7 @@ sub _check_ptree { $text .= $self->_check_ptree($contents, $line, $file, "$nestlist$cmd"); next; } - if($nestlist =~ /$cmd/) { + if(index($nestlist, $cmd) != -1) { $self->poderror({ -line => $line, -file => $file, -severity => 'WARNING', -msg => "nested commands $cmd<...$cmd<...>...>"}); Superficially I find that more ugly, but it actually is clearer what the code is doing, as [if I've understood it correctly] $cmd is always 1 letter. NYProf reports this line goes from 0.725s to 0.056s diff --git a/cpan/Pod-Parser/lib/Pod/Parser.pm b/cpan/Pod-Parser/lib/Pod/Parser.pm index c807f3f..7d54864 100644 --- a/cpan/Pod-Parser/lib/Pod/Parser.pm +++ b/cpan/Pod-Parser/lib/Pod/Parser.pm @@ -840,7 +840,7 @@ sub parse_text { $seq->append($expand_text ? &$xtext_sub($self,$_,$seq) : $_); } ## Keep track of line count - $line += s/\r*\n//; + $line += /\n/; ## Remember the "current" sequence $seq = $seq_stack[-1]; } ˘ $_ isn't used subsequently, so the substitution isn't necessary. That makes the need to match \r* redundant. 1.99s to 1.61s Does this code date from a time when the parser was different, chewing through the text, and the substitution was needed? @@ -1096,7 +1096,7 @@ sub parse_from_filehandle { ## See if this line is blank and ends the current paragraph. ## If it isnt, then keep iterating until it is. - next unless (($textline =~ /^([^\S\r\n]*)[\r\n]*$/) + next unless (($textline =~ /^[^\S\r\n]*[\r\n]*$/) && (length $paragraph)); ## Now process the paragraph The capture isn't needed. 4.73s to 4.43s I'm wondering if there are other unneeded captures, but didn't look, as there weren't any others in obviously hot places. The structure of parse_text is bugging me, as this code repeats work: time calls time/call code 0.153206 87640 0.000002 $_ = $text; 0.765981 87640 0.000009 my @tokens = split /([A-Z]<(?:<+\s)?)/; 0.111030 87640 0.000001 while ( @tokens ) { 0.407500 255096 0.000002 $_ = shift @tokens; 0.000000 0 0.000000 ## Look for the beginning of a sequence 1.999841 255096 0.000008 if ( /^([A-Z])(<(?:<+\s)?)$/ ) { and it's not cheap work. That last line alone is 2 seconds of repeated work, and over 1% of the *total* runtime. :-( I can't see a way to avoid that, without a substantial refactoring. Nicholas Clark

Subject:

Pod-Parser.patch

diff --git a/cpan/Pod-Parser/lib/Pod/Checker.pm b/cpan/Pod-Parser/lib/Pod/Checker.pm index a230542..f901802 100644 --- a/cpan/Pod-Parser/lib/Pod/Checker.pm +++ b/cpan/Pod-Parser/lib/Pod/Checker.pm @@ -1099,7 +1099,7 @@ sub _check_ptree { $text .= $self->_check_ptree($contents, $line, $file, "$nestlist$cmd"); next; } - if($nestlist =~ /$cmd/) { + if(index($nestlist, $cmd) != -1) { $self->poderror({ -line => $line, -file => $file, -severity => 'WARNING', -msg => "nested commands $cmd<...$cmd<...>...>"}); diff --git a/cpan/Pod-Parser/lib/Pod/Parser.pm b/cpan/Pod-Parser/lib/Pod/Parser.pm index c807f3f..7d54864 100644 --- a/cpan/Pod-Parser/lib/Pod/Parser.pm +++ b/cpan/Pod-Parser/lib/Pod/Parser.pm @@ -840,7 +840,7 @@ sub parse_text { $seq->append($expand_text ? &$xtext_sub($self,$_,$seq) : $_); } ## Keep track of line count - $line += s/\r*\n//; + $line += /\n/; ## Remember the "current" sequence $seq = $seq_stack[-1]; } @@ -1096,7 +1096,7 @@ sub parse_from_filehandle { ## See if this line is blank and ends the current paragraph. ## If it isnt, then keep iterating until it is. - next unless (($textline =~ /^([^\S\r\n]*)[\r\n]*$/) + next unless (($textline =~ /^[^\S\r\n]*[\r\n]*$/) && (length $paragraph)); ## Now process the paragraph

Sat Jan 21 14:44:47 2012 Marek.Rouchal [...] gmx.net - Taken

Sat Jan 21 14:48:11 2012 Marek.Rouchal [...] gmx.net - Correspondence added

Applied...thanks for the sleuthing. Will be released with Pod-Parser-1.50 soon. Regarding the last comment: I had to change the regexp like this: ## Look for the beginning of a sequence if ( /^([A-Z])(<(?:<+(?:\r?\n|[ \t]))?)$/ ) { Would it be faster if written like this: ## Look for the beginning of a sequence if ( /^([A-Z])(<(?:<+(?:\r?\n|[ \t])|))$/ ) { And no, I do not have more optimization ideas... and the future is Pod::Simple, anyway :-) -Marek

Sat Jan 21 14:48:12 2012 The RT System itself - Status changed from 'new' to 'open'

Thu Feb 02 10:17:50 2012 nick [...] ccl4.org - Correspondence added

Subject:	Re: [rt.cpan.org #71139] 3 possible optimisationss
Date:	Thu, 2 Feb 2012 15:17:41 +0000
To:	Marek Rouchal via RT <bug-Pod-Parser [...] rt.cpan.org>
From:	Nicholas Clark <nick [...] ccl4.org>

On Sat, Jan 21, 2012 at 02:48:12PM -0500, Marek Rouchal via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=71139 > > > Applied...thanks for the sleuthing. Will be released with > Pod-Parser-1.50 soon.

Thanks. Show quoted text

> Regarding the last comment: I had to change the regexp like this: > > ## Look for the beginning of a sequence > if ( /^([A-Z])(<(?:<+(?:\r?\n|[ \t]))?)$/ ) { > > Would it be faster if written like this: > > ## Look for the beginning of a sequence > if ( /^([A-Z])(<(?:<+(?:\r?\n|[ \t])|))$/ ) {

I tried benchmarking this change, and it doesn't seem to help enough to be properly measurable. Superficially the change is 0.03s faster: 247154 1.08s 247154 264ms if ( /^([A-Z])(<(?:<+(?:\r?\n|[ \t])|))$/ ) { # spent 264ms making 247154 calls to Pod::Parser::CORE:match, avg 1µs/call 247154 1.05s 247154 269ms if ( /^([A-Z])(<(?:<+(?:\r?\n|[ \t]))?)$/ ) { # spent 269ms making 247154 calls to Pod::Parser::CORE:match, avg 1µs/call but that's 1 run vs 1 run, and the noise is more than that. The other changes I found were at least 0.3s, ie 10 times better. I don't think that this change is worth making - I'm not even confident it is actually a speedup, and my opinion is that the original code is a bit easier to read and understand. ('?' for "zero or one of these" being clearer to my mind that '|' for "alternatively match nothing") Show quoted text

> And no, I do not have more optimization ideas... and the future is > Pod::Simple, anyway :-)

Yes, and I'm not sure what I can do to help the future arrive sooner. Nicholas Clark

Sun Jan 27 03:21:04 2013 Marek.Rouchal [...] gmx.net - Correspondence added

Implemented as suggested in Pod-Parser-1.51.

Sun Jan 27 03:21:05 2013 Marek.Rouchal [...] gmx.net - Status changed from 'open' to 'resolved'