An attempt to have RecDescent process a file in chunks apparently
interferes with the expected behavior of a repeated rule. I'd be
grateful for more-expert opinions on what is going on here.
Motivation
----------
A particular parser built with RecDescent runs slowly when processing a
2MB input file, typically requiring 1.5 hours. Processing the same
amount of data with the same parser is 18 times faster, requiring only 5
minutes, when the input file is broken into ~10 pieces. The suggestions
in the FAQ do not apply
(
http://search.cpan.org/~tbone/Parse-RecDescent-FAQ/FAQ.pm#Parse::RecDes
cent_is_slow_on_Really_Big_Files._How_can_I_speed_it_up?).
So I modified a repeated rule in the the parser to read successive
chunks of the input file each time the rule is tried, updating $text
each time. But this apparently causes the parser to try the rule only
once, rather than repeatedly until it fails to match.
Demonstrating the unexpected behavior
-------------------------------------
The real-world program has been boiled down to a demonstration program
that attempts to process a "report" consisting of an arbitrary number of
"account" records. The demonstration program reads from a DATA section,
rather than an external file. Four slightly different versions this
demonstration program are attached to this email, along with the output
they produce on stdout and traces from stderr, as produced commands like
perl stopsTooSoon >stopsTooSoon_stdout.txt
Show quoted text2>stopsTooSoon_stderr.txt
The file stopsTooSoon shows the program as I hoped to write it, but
exhibits the unexpected behavior. The "report" rule asks that the
"account" rule be greedily matched an arbitrary number of times:
report :
account(s)
...
The "account" production starts with an action that reads the next line
of input into $text, returning undef after the last one. I expected
this "account" rule to be matched twice, once for each line of data.
But the output in stopsTooSoon_stdout.txt shows only a single line
output by the print statement. Consistent with the single output line,
the trace from stopsTooSoon_stderr.txt shows that, after a single
"Matched rule" trace for the "account" rule, the parser returns to
processing the "report" rule. Instead, I expected the trace to show a
second (and successful) attempt at "Trying rule: [account]" before
returning to the "report" rule.
In an attempt to find a small change that would produced the expected
behavior, I produced three slightly-modified programs, each of which
works as expected, although with accompanying disadvantages for the
programmer. The programs and their outputs are attached. Descriptions
of these modified programs follow.
Fixed Repetition factor gives the expected behavior
---------------------------------------------------
A slightly different program works as expected, even though it includes
the $text-reading action. The only significant difference in
fixedReps_prints2, is the change of the variable repetition to an exact
number:
report :
account(2)
...
The corresponding stdout shows the expected 2 lines from the print
statement, and the stderr trace shows the expected 2 attempts at
Trying rule: [account]
For the real-world program, using an exact number of repetitions
requires counting the number of accounts and using that account in the
grammar specification. This is workaround I am currently employing.
Reading all the input at once gives the expected behavior
---------------------------------------------------------
For most of our use of RecDescent, we read all the input into a single
variable. This is done in singleRead_prints2, whose stdout output shows
the expected 2 lines, and whose stderr trace shows multiple attempts at
Trying rule: [account]
In the real-world, this is acceptable for small input files, but is slow
for larger inputs, as described above under "Motivation".
Using the default value of <skip> gives the expected behavior
-------------------------------------------------------------
Just in case it helps, I include another change that gives the expected
behavior, which I stumbled across in boiling down the original program:
Although the original grammar used the <skip> directive so that the
grammar could match ends-of-lines, reverting to default <skip> behavior
produces the expected output on stdout and stderr.
I haven't tried this in the real-world program, and I don't have a
conjecture about whether this modification would robustly work around
the issue. In addition, eschewing matching end-of-lines would change
the language that the parser accepts.
Environment
-----------
All of the attachments come from Windows XP and the following software:
Show quoted text> perl -v
This is perl, v5.8.4 built for MSWin32-x86-multi-thread
(with 3 registered patches, see perl -V for more detail)
Copyright 1987-2004, Larry Wall
Binary build 810 provided by ActiveState Corp.
http://www.ActiveState.com ActiveState is a division of Sophos. Built
Jun 1 2004 11:52:21 ...
Show quoted text> ppm query RecDescent
Querying target 1 (ActivePerl 5.8.4.810)
1. Parse-RecDescent [1.94] Generate Recursive-Descent Parsers
The behavior is the same under on two unix systems, where "uname -a" and
"perl-v" show
Linux amarld16.acml.com 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14
EDT 2005 i686 athlon i386 GNU/Linux
This is perl, v5.8.5 built for i386-linux-thread-multi
OSF1 dec050.beehive.com V4.0 1530 alpha
This is perl, v5.8.0 built for alpha-dec_osf
Desideratum
-----------
By comparing the RD_TRACE files, I started to guess that the different
behavior arises in _parserepeat. At that point I abandoned digging and
started this plea for help.
What I'd really like is advice on how to make a version of stopsTooSoon
that tries the "account" rule an arbitrary number of times, reads a
chunk of the input on each try, and allows the grammar to recognize
end-of-lines.
Thanks for your time and any help.
David Holmes
David.Holmes@AllianceBernstein.com 212-823-264
AllianceBernstein L.P. 1345 Avenue of the Americas, New York NY 10105
-----------------------------------------
The information contained in this transmission may be privileged and
confidential and is intended only for the use of the person(s) named
above. If you are not the intended recipient, or an employee or agent responsible
for delivering this message to the intended recipient, any review, dissemination,
distribution or duplication of this communication is strictly prohibited. If you are
not the intended recipient, please contact the sender immediately by reply e-mail
and destroy all copies of the original message. Please note that we do not accept
account orders and/or instructions by e-mail, and therefore will not be responsible
for carrying out such orders and/or instructions. If you, as the intended recipient
of this message, the purpose of which is to inform and update our clients, prospects
and consultants of developments relating to our services and products, would not
like to receive further e-mail correspondence from the sender, please "reply" to the
sender indicating your wishes. In the U.S.: 1345 Avenue of the Americas, New York,
NY 10105.