Subject: | UTF8 bug fix for issue id 116747 |
Date: | Thu, 30 Jan 2020 20:38:01 +0000 |
To: | "bug-re-engine-RE2 [...] rt.cpan.org" <bug-re-engine-RE2 [...] rt.cpan.org> |
From: | Todd Richmond <trichmond [...] proofpoint.com> |
I finally tracked down the reason why issue 116747 exists and the fix is trivial. You must set the RXf_MATCH_UTF8 flag on the perl regex object to ensure that all captures are correctly computed as UTF8. There are 2 critical issues involved that are fixed by this and re::engine::PCRE has the same bug so I’ll file it for that project too
1. All captures as well as ${^PREMATCH} and ${^POSTMATCH} will correctly have their utf8 bits set
2. $+[0] and $-[0] (offsets of captures) will be computed correctly for utf8 chars rather than byte offset. When these are wrong, it is impossible to find a match in the original text which is critical due to a horrific perf problem when using ${PREMATCH}
Note that there is a chance that the #ifdef RXf_UTF8 case won’t compile if that was for older perl versions. The SvUTF(pattern) case is called in the current releases we have. In that case you can move the call to options.set_encoding into the 2 different #ifdef blocks
Todd
*** re2_xs.cc.orig 2020-01-28 08:24:25.175176788 -0800
--- re2_xs.cc 2020-01-28 08:40:01.395615912 -0800
***************
*** 115,124 ****
// XXX: Need to compile two versions?
/* The pattern is not UTF-8. Tell RE2 to treat it as Latin1. */
#ifdef RXf_UTF8
! if (!(flags & RXf_UTF8))
#else
! if (!SvUTF8(pattern))
#endif
options.set_encoding(RE2::Options::EncodingLatin1);
options.set_log_errors(false);
--- 115,127 ----
// XXX: Need to compile two versions?
/* The pattern is not UTF-8. Tell RE2 to treat it as Latin1. */
#ifdef RXf_UTF8
! if (flags & RXf_UTF8)
! extflags |= RXf_MATCH_UTF8;
#else
! if (SvUTF8(pattern))
! extflags |= RXf_MATCH_UTF8;
#endif
+ else
options.set_encoding(RE2::Options::EncodingLatin1);
options.set_log_errors(false);