Bug #128741 for Regexp-Grammars: Extra space added to array elements in a rule

Thu Mar 07 11:41:14 2019 HAKONH [...] cpan.org - Ticket created

Subject:

Extra space added to array elements in a rule

Thanks for this very useful module! I just want to make you aware of a minor issue I had (looks like a bug to me). Consider following code: use strict; use warnings; use Regexp::Grammars; my $parser = qr{ <[item]>+ <rule: item> \w+ }x; my $text = 'itemA itemB itemC'; if ($text =~ $parser) { print "'$_'\n" for (@{ $/{item} }); } The output is: 'itemA' ' itemB' ' itemC' Notice the space in front of the second and third item. Expected output (since \w does not match a space): 'itemA' 'itemB' 'itemC' or the expected output should simply be none/empty (i.e.: parse failed), since I did not explicitly specify the delimiter space. For example, changing the parser to: my $parser = qr{ <[item]>+ % <.ws> <rule: item> \w+ }x; gives the expected output above. Have a nice day. Best regard, Håkon Hægland

Thu Mar 07 18:08:43 2019 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #128741] Extra space added to array elements in a rule
Date:	Thu, 7 Mar 2019 23:08:03 +0000
To:	bug-Regexp-Grammars [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Hi Håkon, Thanks for the report. However, this is not a bug; it is the expected and documented behaviour, as described in "Tokens vs rules (whitespace handling)" in the module's documentation. Defining <item> as a <rule:....> means that any leading whitespace (such as the whitespace before the \w+) matches leading whitespace in the input. Which is exactly what you're seeing. In other words (as the documentation explains): <rule: item> \w+ is equivalent to: <token: item> <.ws> \w+ You observed: Show quoted text

> Notice the space in front of the second and third item. Expected > output (since \w does not match a space):

Correct, but the whitespace before the \w DOES (implicitly) match a space because <item> is defined as a rule, not a token. Show quoted text

> or the expected output should simply be none/empty (i.e.: parse > failed), since I did not explicitly specify the delimiter space.

However, you IMPLICITLY specified the delimiter space, by making <item> a rule. Which is why it succeeds. Show quoted text

> example, changing the parser to: > > my $parser = qr{ > <[item]>+ % <.ws> > <rule: item> \w+ > }x; > > gives the expected output above.

And that's why I would recommend writing the grammar as follows: my $parser = qr{ <[item]>+ % <.ws> <token: item> \w+ }x; As a general principle: make each named component of a grammar a <token:...>, unless you explicitly need to match intervening whitespace before or inside that component, in which case make it a <rule:...> And, if you do make a component a rule, then you must expect it to match (and return) that intervening whitespace as well. Hope this helps, Damian

Thu Mar 07 18:08:44 2019 The RT System itself - Status changed from 'new' to 'open'

Thu Mar 07 18:08:45 2019 DCONWAY [...] cpan.org - Status changed from 'open' to 'rejected'

Fri Mar 08 11:00:11 2019 hakon.hagland [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #128741] Extra space added to array elements in a rule
Date:	Fri, 8 Mar 2019 16:59:47 +0100
To:	bug-Regexp-Grammars [...] rt.cpan.org
From:	Håkon Hægland <hakon.hagland [...] gmail.com>

Hi Damian, thanks for the clear explanation! I see from your answer that I was quite confused about the fine distinction between token and rule. Anyway, this module is awesome! Have a great weekend, Best regards Håkon On Fri, Mar 8, 2019 at 12:08 AM damian@conway.org via RT < bug-Regexp-Grammars@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=128741 > > > Hi Håkon, > > Thanks for the report. > > However, this is not a bug; it is the expected and documented behaviour, > as described in "Tokens vs rules (whitespace handling)" > in the module's documentation. > > Defining <item> as a <rule:....> means that any leading whitespace > (such as the whitespace before the \w+) matches leading whitespace > in the input. Which is exactly what you're seeing. > > In other words (as the documentation explains): > > <rule: item> \w+ > > is equivalent to: > > <token: item> <.ws> \w+ > > > You observed: >

> > Notice the space in front of the second and third item. Expected > > output (since \w does not match a space):

> > Correct, but the whitespace before the \w DOES (implicitly) match a space > because <item> is defined as a rule, not a token. > >

> > or the expected output should simply be none/empty (i.e.: parse > > failed), since I did not explicitly specify the delimiter space.

> > However, you IMPLICITLY specified the delimiter space, by making <item> > a rule. Which is why it succeeds. > >

> > example, changing the parser to: > > > > my $parser = qr{ > > <[item]>+ % <.ws> > > <rule: item> \w+ > > }x; > > > > gives the expected output above.

> > And that's why I would recommend writing the grammar as follows: > > my $parser = qr{ > <[item]>+ % <.ws> > <token: item> \w+ > }x; > > > As a general principle: make each named component of a grammar a > <token:...>, > unless you explicitly need to match intervening whitespace before or inside > that component, in which case make it a <rule:...> > > And, if you do make a component a rule, then you must expect it to match > (and return) that intervening whitespace as well. > > > Hope this helps, > > Damian >