Skip Menu |

This queue is for tickets about the re-engine-RE2 CPAN distribution.

Report information
The Basics
Id: 131618
Status: new
Priority: 0/
Queue: re-engine-RE2

People
Owner: Nobody in particular
Requestors: trichmond [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: UTF8 bug fix for issue id 116747
Date: Thu, 30 Jan 2020 20:38:01 +0000
To: "bug-re-engine-RE2 [...] rt.cpan.org" <bug-re-engine-RE2 [...] rt.cpan.org>
From: Todd Richmond <trichmond [...] proofpoint.com>
I finally tracked down the reason why issue 116747 exists and the fix is trivial. You must set the RXf_MATCH_UTF8 flag on the perl regex object to ensure that all captures are correctly computed as UTF8. There are 2 critical issues involved that are fixed by this and re::engine::PCRE has the same bug so I’ll file it for that project too 1. All captures as well as ${^PREMATCH} and ${^POSTMATCH} will correctly have their utf8 bits set 2. $+[0] and $-[0] (offsets of captures) will be computed correctly for utf8 chars rather than byte offset. When these are wrong, it is impossible to find a match in the original text which is critical due to a horrific perf problem when using ${PREMATCH} Note that there is a chance that the #ifdef RXf_UTF8 case won’t compile if that was for older perl versions. The SvUTF(pattern) case is called in the current releases we have. In that case you can move the call to options.set_encoding into the 2 different #ifdef blocks Todd *** re2_xs.cc.orig 2020-01-28 08:24:25.175176788 -0800 --- re2_xs.cc 2020-01-28 08:40:01.395615912 -0800 *************** *** 115,124 **** // XXX: Need to compile two versions? /* The pattern is not UTF-8. Tell RE2 to treat it as Latin1. */ #ifdef RXf_UTF8 ! if (!(flags & RXf_UTF8)) #else ! if (!SvUTF8(pattern)) #endif options.set_encoding(RE2::Options::EncodingLatin1); options.set_log_errors(false); --- 115,127 ---- // XXX: Need to compile two versions? /* The pattern is not UTF-8. Tell RE2 to treat it as Latin1. */ #ifdef RXf_UTF8 ! if (flags & RXf_UTF8) ! extflags |= RXf_MATCH_UTF8; #else ! if (SvUTF8(pattern)) ! extflags |= RXf_MATCH_UTF8; #endif + else options.set_encoding(RE2::Options::EncodingLatin1); options.set_log_errors(false);