From: | Tels <nospam-abuse [...] bloodgate.com> |
To: | bug-PPI [...] rt.cpan.org |
Subject: | [PATCH] Speed up tokenizer char-by-char |
Date: | Sat, 7 Jan 2006 13:24:36 +0100 |
-----BEGIN PGP SIGNED MESSAGE-----
Moin,
profiling PPI showed that for lines that are not recognized completely,
the line is processed char-by-char.
Unfortunately, this happened in an empty while loop by calling a
subroutine for each character. :)
The attached patch moves the loop inside the subroutine, allowing us to
bypass the calls, the empty while body as well as the repeated checks for
the valid cursor pos. I also eliminated duplicate code inside the loop.
The patch also fixes a bug as a side-effect, the process_next_char()
routine did not localize $_. I have not attempted to add a test for that,
though.
The speedup is a few percent, which highly depends on how many times lines
need to be processed char-by-char and how long they are. Example parsing
Graph::Easy.pm 5 times (to avoid start-up overhead skewing the results,
the results are still skwed by the DESTROY e.g. the parsing is speed up
more than shown here):
Lowest from three runs:
te@linux:~/perl/PPI> time perl d.pl
real 0m5.376s
user 0m5.288s
sys 0m0.066s
te@linux:~/perl/PPI> time perl -IPPI-1.109.e/lib/ d.pl
real 0m5.181s
user 0m5.110s
sys 0m0.054s
On this particular data, PPI is now about 3..4% faster. All tests still
pass. Also attached are two profile runs.
The .pm file 2489 lines, the test parses 12490 lines in 5.18 seconds,
making PPI parsing about 2400 lines/s on my 2.0 Ghz AMD Athlon. Not
bad :)
Further ideas are to:
* recognize more things entirely, so char-by-char overhead is reduced
* less subroutines (to concentrate code hot spots)
* find out what calls __ANON__ (which smells like something is triggering
an overload, needlessly)
Hope you like this work,
Tels
- --
Signed on Sat Jan 7 12:51:00 2006 with key 0x93B84C15.
Visit my photo gallery at http://bloodgate.com/photos/
PGP key on http://bloodgate.com/tels.asc or per email.
If you are bald, and comb some of your hair over the bald spot, you are
violating US Patent #4,022,227: <http://tinyurl.com/6qxl7>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iQEVAwUBQ7+zBHcLPEOTuEwVAQHFrQf+Oxf3qwPF+OPi8ZMgPSa+h2oTCbS11B2Y
IivSO3SuOp3okljWi8eEmLEdJa1tVYIw+kcXp+7/TUhS8XOKhy1LPHUAV6fKUbHE
MtXu+EJ5/zYk3Xh2GwWRuK7IG7KiggnoteuonjGwVW2Ry5mMn+9wxAoN9bjRo/cf
Jo6JKVdKss/Asq2yFL4p66YiK6FxPcohq8EhEkBUFYNoGaBzMxemNDDcha5Zhg/s
pbc1u2tJjtxJU/tyR0T112i4Ay+H7gyuag3ah4j97Ltjasd7qPOMEiZ3TvTmws8A
U1dDiH9tQzPRIbIAJAi9py2yEf4dBALn1bp13OppbYlm+vErUYwxgw==
=LRYP
-----END PGP SIGNATURE-----
Message body is not shown because sender requested not to inline it.
Message body is not shown because sender requested not to inline it.
Message body is not shown because sender requested not to inline it.