Bug #124007 for Regexp-Grammars: Whitespace handling for \n with /m

Thu Jan 04 10:44:49 2018 se_misc [...] hotmail.com - Ticket created

Subject:	Whitespace handling for \n with /m
Date:	Thu, 4 Jan 2018 15:44:37 +0000
To:	"bug-Regexp-Grammars [...] rt.cpan.org" <bug-Regexp-Grammars [...] rt.cpan.org>
From:	Stefan Eichenberger <se_misc [...] hotmail.com>

Hi Damian, Running the below code IMHO displays inconsistent handling of \n-whitespace under modifier /m. I initially raised the issue over at StackOverflow (https://stackoverflow.com/questions/48042738/regexpgrammars-handling-n/48084744?noredirect=1#comment83153394_48084744), but believe the problem lies in the engine, not the user. Arguably, I'm new to Regexp::Grammars, so I hesitate to exclude the user though ... Thx. for your help Stefan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # this code version reported to bug-Regexp-Grammars, 2018-01-04 use Regexp::Grammars; my($text, $parser); $text = "line_1_1,line_1_2\nline_2_1,line_2_2"; $i = 1; print "Example $i: 2nd line match contains \\n despite '.' not matching \\n with modifier /m\n"; $parser = qr { <data> <rule: data> <[line]>+ <rule: line> .+ }xm; if ($text =~ $parser) { print "Matched $i"; } else { print "Not matched $i"; } print "\npause $i...\n\n"; $i++; print "Example $i: 2nd line match contains \\n despite explicit exclusion\n"; $parser = qr { <data> <rule: data> <[line]>+ <rule: line> [^\n]+ }xm; if ($text =~ $parser) { print "Matched $i"; } else { print "Not matched $i"; } print "\npause $i...\n\n"; $i++; print "Example $i: separator \$ seems to consume \\n (using separator \\n also works)\n"; $parser = qr { <data> <rule: data> <[line]>+ % $ # Note: \n als works here <rule: line> .+ }xm; if ($text =~ $parser) { print "Matched $i"; } else { print "Not matched $i"; } print "\npause $i...\n\n"; $i++; print "Example $i: contexts of 'line' matches still contain \\n, but fields no longer; so here explicit exclusion of \\n in rule seems to work\n"; $parser = qr { <data> <rule: data> <[line]>+ <rule: line> <[field]>+ % , <rule: field> [^,\n]+ }xm; if ($text =~ $parser) { print "Matched $i"; } else { print "Not matched $i"; } print "\npause $i...\n\n"; $i++; print "Example $i: returns 3 fields, where 2nd field contains \\n - probably due to greedy match of 'field'\n"; $parser = qr { <data> <rule: data> <[line]>+ % $ <rule: line> <[field]>+ % , <rule: field> [^,]+ }xm; if ($text =~ $parser) { print "Matched $i"; } else { print "Not matched $i"; } print "\npause $i...\n\n"; $i++;

Thu Jan 04 10:51:57 2018 se_misc [...] hotmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #124007]
Date:	Thu, 4 Jan 2018 15:51:24 +0000
To:	"bug-Regexp-Grammars [...] rt.cpan.org" <bug-Regexp-Grammars [...] rt.cpan.org>
From:	Stefan Eichenberger <se_misc [...] hotmail.com>

Sorry, forgot to mention environment: Regexp::Grammar 1.048 on Strawberry 5.26.1 on Win-7

Thu Jan 04 14:13:01 2018 DCONWAY [...] cpan.org - Status changed from 'new' to 'rejected'

Thu Jan 04 14:13:15 2018 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #124007] Whitespace handling for \n with /m
Date:	Fri, 5 Jan 2018 06:12:20 +1100
To:	bug-Regexp-Grammars [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Hi Stefan, I'm afraid the bug in in the user in this case. ;-) A rule with whitespace within it matches any whitespace (including newlines) in the input at that point. So a rule like: <rule: line> .+ is really equivalent to: <rule: line><.ws>.+ meaning: match-but-don't-capture any leading whitespace, then match any-characters-except-newline. And it's the implicit call to <.ws> that "eats" the newlines preceding each line, which is why the first two examples match. If you want whitespace inside the rule to be ignored (as you seem to want here), then you need to declare the rule as a token instead. Tokens don't have the magical "whitespace-matches-whitespace" behaviour of rules. Hence you would write: <token: line> .+ in which case you will also need to explicitly consume the newlines separating each line, with something like: <rule: data> <[line]>+ % \n or perhaps: <rule: data> <[line]>+ % \n+ if you want to allow multiple newlines between lines. Hope this helps. If not, feel free to ask for further clarification. Damian

Fri Jan 05 13:21:19 2018 se_misc [...] hotmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #124007] Whitespace handling for \n with /m
Date:	Fri, 5 Jan 2018 18:21:08 +0000
To:	"bug-Regexp-Grammars [...] rt.cpan.org" <bug-Regexp-Grammars [...] rt.cpan.org>
From:	Stefan Eichenberger <se_misc [...] hotmail.com>

Hi Damian, Happy to be at fault here ;-) - your explanation is perfectly clear and makes sense. I'll update StackOverflow accordingly, to avoid confusion. You may consider updating perldoc in chapter 'Tokens vs. rules': The earlier example of a LaTeX matcher makes liberal use of rules with <.ws> inference; since LaTeX is not line oriented, that probably works, but such domain specific knowledge should not be implied. I've read 'Tokens vs. rules' multiple times, but didn't trigger on the critical notion that <rule: line> .+ ==> <rule: line><.ws>.+ Thanks again for your kind help - which makes my basic learning exercise over the XMas vacation a success then :-) Stefan