Skip Menu |

This queue is for tickets about the Parse-RecDescent CPAN distribution.

Report information
The Basics
Id: 21765
Status: open
Priority: 0/
Queue: Parse-RecDescent

People
Owner: Nobody in particular
Requestors: david.holmes [...] alliancebernstein.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: chunks vs. repeated rules in RecDescent
Date: Thu, 28 Sep 2006 13:24:23 -0400
To: <bug-Parse-RecDescent [...] rt.cpan.org>, <damian [...] conway.org>
From: "Holmes, David M" <david.holmes [...] alliancebernstein.com>
An attempt to have RecDescent process a file in chunks apparently interferes with the expected behavior of a repeated rule. I'd be grateful for more-expert opinions on what is going on here. Motivation ---------- A particular parser built with RecDescent runs slowly when processing a 2MB input file, typically requiring 1.5 hours. Processing the same amount of data with the same parser is 18 times faster, requiring only 5 minutes, when the input file is broken into ~10 pieces. The suggestions in the FAQ do not apply (http://search.cpan.org/~tbone/Parse-RecDescent-FAQ/FAQ.pm#Parse::RecDes cent_is_slow_on_Really_Big_Files._How_can_I_speed_it_up?). So I modified a repeated rule in the the parser to read successive chunks of the input file each time the rule is tried, updating $text each time. But this apparently causes the parser to try the rule only once, rather than repeatedly until it fails to match. Demonstrating the unexpected behavior ------------------------------------- The real-world program has been boiled down to a demonstration program that attempts to process a "report" consisting of an arbitrary number of "account" records. The demonstration program reads from a DATA section, rather than an external file. Four slightly different versions this demonstration program are attached to this email, along with the output they produce on stdout and traces from stderr, as produced commands like perl stopsTooSoon >stopsTooSoon_stdout.txt Show quoted text
2>stopsTooSoon_stderr.txt
The file stopsTooSoon shows the program as I hoped to write it, but exhibits the unexpected behavior. The "report" rule asks that the "account" rule be greedily matched an arbitrary number of times: report : account(s) ... The "account" production starts with an action that reads the next line of input into $text, returning undef after the last one. I expected this "account" rule to be matched twice, once for each line of data. But the output in stopsTooSoon_stdout.txt shows only a single line output by the print statement. Consistent with the single output line, the trace from stopsTooSoon_stderr.txt shows that, after a single "Matched rule" trace for the "account" rule, the parser returns to processing the "report" rule. Instead, I expected the trace to show a second (and successful) attempt at "Trying rule: [account]" before returning to the "report" rule. In an attempt to find a small change that would produced the expected behavior, I produced three slightly-modified programs, each of which works as expected, although with accompanying disadvantages for the programmer. The programs and their outputs are attached. Descriptions of these modified programs follow. Fixed Repetition factor gives the expected behavior --------------------------------------------------- A slightly different program works as expected, even though it includes the $text-reading action. The only significant difference in fixedReps_prints2, is the change of the variable repetition to an exact number: report : account(2) ... The corresponding stdout shows the expected 2 lines from the print statement, and the stderr trace shows the expected 2 attempts at Trying rule: [account] For the real-world program, using an exact number of repetitions requires counting the number of accounts and using that account in the grammar specification. This is workaround I am currently employing. Reading all the input at once gives the expected behavior --------------------------------------------------------- For most of our use of RecDescent, we read all the input into a single variable. This is done in singleRead_prints2, whose stdout output shows the expected 2 lines, and whose stderr trace shows multiple attempts at Trying rule: [account] In the real-world, this is acceptable for small input files, but is slow for larger inputs, as described above under "Motivation". Using the default value of <skip> gives the expected behavior ------------------------------------------------------------- Just in case it helps, I include another change that gives the expected behavior, which I stumbled across in boiling down the original program: Although the original grammar used the <skip> directive so that the grammar could match ends-of-lines, reverting to default <skip> behavior produces the expected output on stdout and stderr. I haven't tried this in the real-world program, and I don't have a conjecture about whether this modification would robustly work around the issue. In addition, eschewing matching end-of-lines would change the language that the parser accepts. Environment ----------- All of the attachments come from Windows XP and the following software: Show quoted text
> perl -v
This is perl, v5.8.4 built for MSWin32-x86-multi-thread (with 3 registered patches, see perl -V for more detail) Copyright 1987-2004, Larry Wall Binary build 810 provided by ActiveState Corp. http://www.ActiveState.com ActiveState is a division of Sophos. Built Jun 1 2004 11:52:21 ... Show quoted text
> ppm query RecDescent
Querying target 1 (ActivePerl 5.8.4.810) 1. Parse-RecDescent [1.94] Generate Recursive-Descent Parsers The behavior is the same under on two unix systems, where "uname -a" and "perl-v" show Linux amarld16.acml.com 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 athlon i386 GNU/Linux This is perl, v5.8.5 built for i386-linux-thread-multi OSF1 dec050.beehive.com V4.0 1530 alpha This is perl, v5.8.0 built for alpha-dec_osf Desideratum ----------- By comparing the RD_TRACE files, I started to guess that the different behavior arises in _parserepeat. At that point I abandoned digging and started this plea for help. What I'd really like is advice on how to make a version of stopsTooSoon that tries the "account" rule an arbitrary number of times, reads a chunk of the input on each try, and allows the grammar to recognize end-of-lines. Thanks for your time and any help. David Holmes David.Holmes@AllianceBernstein.com 212-823-264 AllianceBernstein L.P. 1345 Avenue of the Americas, New York NY 10105 ----------------------------------------- The information contained in this transmission may be privileged and confidential and is intended only for the use of the person(s) named above. If you are not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender immediately by reply e-mail and destroy all copies of the original message. Please note that we do not accept account orders and/or instructions by e-mail, and therefore will not be responsible for carrying out such orders and/or instructions. If you, as the intended recipient of this message, the purpose of which is to inform and update our clients, prospects and consultants of developments relating to our services and products, would not like to receive further e-mail correspondence from the sender, please "reply" to the sender indicating your wishes. In the U.S.: 1345 Avenue of the Americas, New York, NY 10105.
Download stopsTooSoon
application/octet-stream 1.3k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Download fixedReps_prints2
application/octet-stream 1.3k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Download singleRead_prints2
application/octet-stream 1k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Download defaultSkip_prints2
application/octet-stream 1.2k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

From: mhhollomon [...] gmail.com
On Thu Sep 28 13:24:58 2006, david.holmes@alliancebernstein.com wrote: Show quoted text
> An attempt to have RecDescent process a file in chunks apparently > interferes with the expected behavior of a repeated rule. I'd be > grateful for more-expert opinions on what is going on here. > > [ awesome report truncated]
More ancient history ... but this one was interesting ... The problem is the following line in _parserepeat: last if ++$reps >= $min and $prevtextlen == length; The purpose of this line is to keep a repeated empty production from spinning a long time. _parserepeat tries to detect this by checking if any text was consumed. But for the example grammar, this check isn't good enough. Consider the first time through the grammar. $text starts out as "" per the script. That is the $text that _parserepeat takes the length of for $prevtextlen. Now _parserepeat calls the parser for the account rule. the account rule reads in the first line, successfully parsing it and leaving $text as -- you guessed it -- "". One workaround would be to add a changing amount of spaces to the end of $text e.g. make your action something like... { # get data for next account if( $text = <main::DATA>) { $text .= ' ' x ((++$mycounter % 10) + 1); # <=== magic for _parserepeat Parse::RecDescent::LineCounter::resync($thisline); } else{ undef;} # fail this production } Since skip is processed just before token recognition, these spaces won't be stripped until just before the recognition of the literal string. I have attached a patch that co-opts resync in an attempt to give _parserepeat a little more information about the possiblity of changes to $text. I'm conflicted about this since it is only a drop in the bucket for what needs to be done to truly handle "chunked" input. So, I'm not sure if this provides a useful feature or simply gives people another way to shoot themselves in the foot.
Subject: limited_chunk_processing_support.patch
*** RecDescent.pm.new 2010-11-12 09:54:42.613262400 -0500 --- RecDescent.pm 2010-11-12 10:03:03.286627900 -0500 *************** sub resync # ($linecounter) *** 138,143 **** --- 138,144 ---- - Parse::RecDescent::_linecount(${$self->{text}}) + 1; + $parser->{'resynccount'} += 1; $parser->{offsetlinenum} += $parser->{lastlinenum} - $apparently; return 1; } *************** sub AUTOLOAD # ($parser, $text; $line *** 2861,2866 **** --- 2862,2868 ---- $_[0]->{offsetlinenum} = $_[0]->{lastlinenum}; $_[0]->{fulltext} = $text; $_[0]->{fulltextlen} = length $text; + $_[0]->{resynccount} = 0; $_[0]->{deferred} = []; $_[0]->{errors} = []; my @args = @_[3..$#_]; *************** sub _parserepeat($$$$$$$$$$) # RETURN *** 2903,2908 **** --- 2905,2911 ---- $_[6]->at($text); # $_[6] IS $expectation FROM CALLER my $_savetext = $text; my $prevtextlen = length $text; + my $resynccount = $parser->{'resynccount'}; my $_tok; if (! defined ($_tok = &$prod($parser,$text,1,$_noactions,$argcode))) { *************** sub _parserepeat($$$$$$$$$$) # RETURN *** 2910,2916 **** last; } push @tokens, $_tok if defined $_tok; ! last if ++$reps >= $min and $prevtextlen == length $text; } do { $_[6]->failed(); return undef} if $reps<$min; --- 2913,2919 ---- last; } push @tokens, $_tok if defined $_tok; ! last if ++$reps >= $min and $prevtextlen == length $text and $resynccount == $parser->{'resynccount'}; } do { $_[6]->failed(); return undef} if $reps<$min;