Subject: | Repetition operator doesn't conform to defined whitespace behaviour |
Hi Damian,
It seems that the repetition operator '**' doesn't obey the same
whitespace rules as the rest of Regexp::Grammars. Instead, the meaning
of the surrounding whitespace is always interpreted as though inside a
rule.
eg.
qr/
<val>**<sep> # no whitespace - so none is matched
<val> ** <sep> # whitespace is treated as <ws> rather than ignored
/x
I believe that in general, in Perl 5, the whitespace behaviour of rules
is an exception, not the norm. With this being the case, the whitespace
around the repetition operator should only demonstrate the above
behaviour when used inside a rule. Outside of a rule the surrounding
whitespace should follow the convention of the regular expression in
which it is used, ie. treated literally, or ignored if /x is used.
This issue presented itself when attempting to use a specific length of
whitespace as the seperator for a repetition. Despite encapsulating the
repetition operator in a token (a desperate measure), debugging showed
that the regular expression engine was backtracking unnecessarily, and
matching whitespace longer than the specified length. Further
investigation led to the aforementioned issue - a distilled test case is
attached.
Many thanks,
Andrew Whatson
Subject: | test_whitespace_seperator.pl |
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use Data::Dumper;
use Regexp::Grammars;
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Terse = 1;
# The text to match against
my $text = 'a' . (' ' x 5) . 'z';
# This should match without backtracking
my $broken1 = qr/
<logfile: - >
<debug: on>
\A<TOP>\Z
<token: TOP> <[val]> ** <sep>
<token: sep> \s{5}
<token: val> \w+
/x;
# This should NOT match
#
# NB: There are 5 spaces in the target, but we're matching for a list
# separated by 3 whitespace characters
#
my $broken2 = qr/
<logfile: - >
<debug: on>
\A<TOP>\Z
<token: TOP> <[val]> ** <sep>
<token: sep> \s{3}
<token: val> \w+
/x;
# This demonstrates the expected behaviour of $broken1
my $correct1 = qr/
<logfile: - >
<debug: on>
\A<TOP>\Z
<token: TOP> <[val]> (?: <sep> <[val]> )*
<token: sep> \s{5}
<token: val> \w+
/x;
# This demonstrates the expected behaviour of $broken2
my $correct2 = qr/
<logfile: - >
<debug: on>
\A<TOP>\Z
<token: TOP> <[val]> (?: <sep> <[val]> )*
<token: sep> \s{3}
<token: val> \w+
/x;
$text =~ $broken1 ? print '$broken1 matched: ' . Dumper(\%/) : say '$broken1 did not match';
$text =~ $broken2 ? print '$broken2 matched: ' . Dumper(\%/) : say '$broken2 did not match';
$text =~ $correct1 ? print '$correct1 matched: ' . Dumper(\%/) : say '$correct1 did not match';
$text =~ $correct2 ? print '$correct2 matched: ' . Dumper(\%/) : say '$correct2 did not match';