Bug #122824 for PPR: Recognition failures: UTF-8 BOM, /n regex modifier, ${!}

Wed Aug 16 16:34:40 2017 unobe [...] cpan.org - Ticket created

Subject:

Recognition failures: UTF-8 BOM, /n regex modifier, ${!}

Thank you so much for creating this module! I hope to make good use of it. I've noticed some things when testing it against some code. If a UTF-8 BOM is present, it inaccurately gives a false failure. Wikipedia referenced Unicode to state that it should only appear at the beginning of a text stream, but if it appears elsewhere, it should be regarded as a zero-width non-breaking space. I've attached a patch for the common and proper case (beginning of text stream) because that resolves my itch. Long-term, would it be better to modify many of the instances of \s in the regexes to \p{Whitespace}? If a /n regex modifier is present, that also causes recognition failure. I've included that in the patch. Also, some built-in Perl variables weren't being recognized when written in a non-normative, but valid format (e.g., ${!}), so I've included a patch for that as well.

Subject:

bom_slash_n_dollar_bang.patch

--- a/lib/perl5/site_perl/5.22.1/PPR.pm +++ b/lib/perl5/site_perl/5.22.1/PPR.pm @@ -62,7 +62,7 @@ use utf8; our $GRAMMAR = qr{ (?(DEFINE) (?<PerlDocument> - (?>(?&PerlOWS)) + (\x{feff})?+ (?>(?&PerlOWS)) (?: (?>(?&PerlStatement)) (?&PerlOWS) )*+ ) # End of rule @@ -820,6 +820,8 @@ our $GRAMMAR = qr{ | [][!"#\$%&'()*+,.\\/:;<=>?\@\^`|~-] | + \{ [!"#\$%&'()*+,.\\/:;<=>?\@\^`|~-] \} + | \{ \w++ \} | (?&PerlBlock) @@ -1098,7 +1100,7 @@ our $GRAMMAR = qr{ (?>(?&PPR_quotelike_body_interpolated_unclosed)) (?&PPR_quotelike_body_interpolated) ) - [msixpodualgcer]*+ + [msixpodualgcern]*+ ) # End of rule ) # End of rule @@ -1143,7 +1145,7 @@ our $GRAMMAR = qr{ ) (?&PPR_quotelike_body_interpolated) ) - [msixpodualgc]*+ + [msixpodualgcn]*+ ) # End of rule ) # End of rule (?= @@ -1160,7 +1162,7 @@ our $GRAMMAR = qr{ qr \b (?> (?= [#] ) | (?! (?>(?&PerlOWS)) => ) ) (?>(?&PPR_quotelike_body_interpolated)) - [msixpodual]*+ + [msixpodualn]*+ ) # End of rule (?<PerlRegex>

Thu Aug 17 03:06:33 2017 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #122824] Recognition failures: UTF-8 BOM, /n regex modifier, ${!}
Date:	Thu, 17 Aug 2017 07:05:41 +0000
To:	bug-PPR [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Hi David, Thanks for the bug reports...and even more so for the patches! :-) I've now applied them all for the next release...which may be a few days yet as I'm still on the road, and still working on a particularly nasty corner case with parsing quotelikes. Show quoted text

> Long-term, would it be better to modify many of the instances > of \s in the regexes to \p{Whitespace}?

I'm not sure that would solve the problem, as I believe the BOM isn't actually included in the Unicode \p{Whitespace} property. Even if it is, even in the latest Perl release: "\x{FEFF}" =~ /\p{Whitespace}/ doesn't match. I might need to look at using [\p{Whitespace}\p{Cf}] instead. I'll need to look at whether that introduces a detectable performance hit though. And one might argue that non-leading BOMs *ought* to be invalid. :-) Much appreciated, Damian

Thu Aug 17 03:06:34 2017 The RT System itself - Status changed from 'new' to 'open'

Mon Aug 21 13:44:38 2017 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #122824] Recognition failures: UTF-8 BOM, /n regex modifier, ${!}
Date:	Mon, 21 Aug 2017 17:43:51 +0000
To:	bug-PPR [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Resolved in the latest release (0.000012). Thanks again, Damian

Mon Aug 21 13:44:59 2017 DCONWAY [...] cpan.org - Status changed from 'open' to 'resolved'