It is a bug in re-engine-pcre that also exists in the cpan pcre2 and re2 projects as well - only the native perl implementation seems correct. I think this is probably a new flag and so the requirement broke all the 3rd party RE implementations without people realizing it
All you need to do is set the flag when you are building a UTF8 regexp object - nothing more. Perl takes care of all the rest by checking that flag and adjusting capture offsets and setting the SV's utf8 bit
Todd
On 1/30/20, 6:05 PM, "demerphq via RT" <bug-re-engine-PCRE@rt.cpan.org> wrote:
<URL:
https://urldefense.com/v3/__https://rt.cpan.org/Ticket/Display.html?id=131619__;!!ORgEfCBsr282Fw!76YhgsEhi3e_zqo-k08Dzz6S4JjVSDmyFacuY1b-KaIUxu33uC3hwuQljMr_6BcDQOA$ >
On Fri, 31 Jan 2020, 06:22 Todd Richmond via RT, <
bug-re-engine-PCRE@rt.cpan.org> wrote:
Show quoted text > Thu Jan 30 17:22:16 2020: Request 131619 was acted upon.
> Transaction: Ticket created by trichmond@proofpoint.com
> Queue: re-engine-PCRE
> Subject: UTF8 capture problem
> Broken in: (no value)
> Severity: (no value)
> Owner: Nobody
> Requestors: trichmond@proofpoint.com
> Status: new
> Ticket <URL:
https://urldefense.com/v3/__https://rt.cpan.org/Ticket/Display.html?id=131619__;!!ORgEfCBsr282Fw!76YhgsEhi3e_zqo-k08Dzz6S4JjVSDmyFacuY1b-KaIUxu33uC3hwuQljMr_6BcDQOA$ >
>
>
> There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being
> set on the perl regex object to ensure that all captures are correctly
> computed as UTF8 when the input is UTF8. There are 2 critical issues
> involved that are fixed by this
>
>
> 1. All captures as well as ${^PREMATCH} and ${^POSTMATCH} will
> correctly have their utf8 bits set
> 2. $+[0] and $-[0] (offsets of captures) will be computed correctly for
> utf8 chars rather than byte offset. When these are wrong, it is impossible
> to compute a substring for match in the original text instead of using
> ${^POSTMATCH} which is required due to a horrific perf problem
>
> XS code will need to do something like this
>
>
> #ifdef RXf_UTF8
>
> if (flags & RXf_UTF8)
>
> extflags |= RXf_MATCH_UTF8;
>
> #else
>
> if (SvUTF8(pattern))
>
> extflags |= RXf_MATCH_UTF8;
>
> #endif
>
I am not clear if this is a bug in perl or a bug in the alternate engine.