Bug #116747 for re-engine-RE2: patch: document a utf-8 flag gotcha

Subject:	patch: document a utf-8 flag gotcha
Date:	Fri, 5 Aug 2016 16:21:01 -0700
To:	<bug-re-engine-RE2 [...] rt.cpan.org>
From:	Philip Guenther <pguenther [...] proofpoint.com>

Having torn my own hair out tracking this down, I would suggest documenting this bug in how some versions of perl handle regexp caching vs the utf-8 flag. Unfortunately, I don't know exactly which version this was fixed in; I don't see anything directly related to it in the perldelta files. Philip Guenther <pguenther@proofpoint.com> --- lib/re/engine/RE2.pm Sun Jan 18 14:58:24 2015 +++ lib/re/engine/RE2.pm Fri Aug 5 16:16:21 2016 @@ -216,6 +216,48 @@ The UTF-8 flag of the regexp currently determines how the string is matched. This is obviously broken, so will be fixed at some point. +=item * Unicode handling vs perl regexp caching + +New enough version of Perl automatically cache the compiled form of a regexp, +using the cache when the text of the regexp is same as before. +In Perl v5.16.3 (and probably other versions), +that logic does B<not> take into account whether or not the UTF-8 flag is +set on the text of the regexp. +This was fixed between there and Perl v5.20.3. + +As a result, with the affected versions of perl, +code that conditionally sets the UTF-8 flag on the regexp may +misbehave and use the compiled version with the other setting of the flag. +For example, consider the following script: + + sub match { + my $text = shift; + my $re = q{pe(\pL)}; + if (utf8::is_utf8($text)) { + utf8::upgrade($re); + #$re = "(?:$re|\\b\\B)"; # uncomment to fix + } + $text =~ /($re)/ or die "no match"; + print '$1 is ', (utf8::valid($1) ? "" : "not "), "valid\n"; + } + $f1 = $f2 = "pe\N{cyrillic small letter a}rl"; + utf8::encode($f1); + match($f1); + match($f2); + +With Perl 5.16.3 that outputs: + + $1 is valid + $1 is not valid + +Uncommenting the indicated line changes the regexp when setting the +UTF-8 flag and thereby prevents the reuse of the wrong version of +the compiled regex, so the output becomes: + + $1 is valid + $1 is valid + + =item * Final newline matching differs to Perl "\n" =~ /$/