Subject: | patch: document a utf-8 flag gotcha |
Date: | Fri, 5 Aug 2016 16:21:01 -0700 |
To: | <bug-re-engine-RE2 [...] rt.cpan.org> |
From: | Philip Guenther <pguenther [...] proofpoint.com> |
Having torn my own hair out tracking this down, I would suggest
documenting this bug in how some versions of perl handle regexp caching vs
the utf-8 flag. Unfortunately, I don't know exactly which version this
was fixed in; I don't see anything directly related to it in the perldelta
files.
Philip Guenther
<pguenther@proofpoint.com>
--- lib/re/engine/RE2.pm Sun Jan 18 14:58:24 2015
+++ lib/re/engine/RE2.pm Fri Aug 5 16:16:21 2016
@@ -216,6 +216,48 @@
The UTF-8 flag of the regexp currently determines how the string is matched.
This is obviously broken, so will be fixed at some point.
+=item * Unicode handling vs perl regexp caching
+
+New enough version of Perl automatically cache the compiled form of a regexp,
+using the cache when the text of the regexp is same as before.
+In Perl v5.16.3 (and probably other versions),
+that logic does B<not> take into account whether or not the UTF-8 flag is
+set on the text of the regexp.
+This was fixed between there and Perl v5.20.3.
+
+As a result, with the affected versions of perl,
+code that conditionally sets the UTF-8 flag on the regexp may
+misbehave and use the compiled version with the other setting of the flag.
+For example, consider the following script:
+
+ sub match {
+ my $text = shift;
+ my $re = q{pe(\pL)};
+ if (utf8::is_utf8($text)) {
+ utf8::upgrade($re);
+ #$re = "(?:$re|\\b\\B)"; # uncomment to fix
+ }
+ $text =~ /($re)/ or die "no match";
+ print '$1 is ', (utf8::valid($1) ? "" : "not "), "valid\n";
+ }
+ $f1 = $f2 = "pe\N{cyrillic small letter a}rl";
+ utf8::encode($f1);
+ match($f1);
+ match($f2);
+
+With Perl 5.16.3 that outputs:
+
+ $1 is valid
+ $1 is not valid
+
+Uncommenting the indicated line changes the regexp when setting the
+UTF-8 flag and thereby prevents the reuse of the wrong version of
+the compiled regex, so the output becomes:
+
+ $1 is valid
+ $1 is valid
+
+
=item * Final newline matching differs to Perl
"\n" =~ /$/