Skip Menu |

This queue is for tickets about the re-engine-PCRE CPAN distribution.

Report information
The Basics
Id: 131619
Status: open
Priority: 0/
Queue: re-engine-PCRE

People
Owner: Nobody in particular
Requestors: trichmond [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: UTF8 capture problem
Date: Thu, 30 Jan 2020 21:26:46 +0000
To: "bug-re-engine-PCRE [...] rt.cpan.org" <bug-re-engine-PCRE [...] rt.cpan.org>
From: Todd Richmond <trichmond [...] proofpoint.com>
There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being set on the perl regex object to ensure that all captures are correctly computed as UTF8 when the input is UTF8. There are 2 critical issues involved that are fixed by this 1. All captures as well as ${^PREMATCH} and ${^POSTMATCH} will correctly have their utf8 bits set 2. $+[0] and $-[0] (offsets of captures) will be computed correctly for utf8 chars rather than byte offset. When these are wrong, it is impossible to compute a substring for match in the original text instead of using ${^POSTMATCH} which is required due to a horrific perf problem XS code will need to do something like this #ifdef RXf_UTF8 if (flags & RXf_UTF8) extflags |= RXf_MATCH_UTF8; #else if (SvUTF8(pattern)) extflags |= RXf_MATCH_UTF8; #endif
Subject: Re: [rt.cpan.org #131619] UTF8 capture problem
Date: Fri, 31 Jan 2020 10:04:42 +0800
To: bug-re-engine-PCRE [...] rt.cpan.org
From: demerphq <demerphq [...] gmail.com>
On Fri, 31 Jan 2020, 06:22 Todd Richmond via RT, < bug-re-engine-PCRE@rt.cpan.org> wrote: Show quoted text
> Thu Jan 30 17:22:16 2020: Request 131619 was acted upon. > Transaction: Ticket created by trichmond@proofpoint.com > Queue: re-engine-PCRE > Subject: UTF8 capture problem > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: trichmond@proofpoint.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=131619 > > > > There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being > set on the perl regex object to ensure that all captures are correctly > computed as UTF8 when the input is UTF8. There are 2 critical issues > involved that are fixed by this > > > 1. All captures as well as ${^PREMATCH} and ${^POSTMATCH} will > correctly have their utf8 bits set > 2. $+[0] and $-[0] (offsets of captures) will be computed correctly for > utf8 chars rather than byte offset. When these are wrong, it is impossible > to compute a substring for match in the original text instead of using > ${^POSTMATCH} which is required due to a horrific perf problem > > XS code will need to do something like this > > > #ifdef RXf_UTF8 > > if (flags & RXf_UTF8) > > extflags |= RXf_MATCH_UTF8; > > #else > > if (SvUTF8(pattern)) > > extflags |= RXf_MATCH_UTF8; > > #endif >
I am not clear if this is a bug in perl or a bug in the alternate engine.
Date: Fri, 31 Jan 2020 04:00:23 +0000
Subject: Re: [rt.cpan.org #131619] UTF8 capture problem
To: "bug-re-engine-PCRE [...] rt.cpan.org" <bug-re-engine-PCRE [...] rt.cpan.org>
From: Todd Richmond <trichmond [...] proofpoint.com>
It is a bug in re-engine-pcre that also exists in the cpan pcre2 and re2 projects as well - only the native perl implementation seems correct. I think this is probably a new flag and so the requirement broke all the 3rd party RE implementations without people realizing it All you need to do is set the flag when you are building a UTF8 regexp object - nothing more. Perl takes care of all the rest by checking that flag and adjusting capture offsets and setting the SV's utf8 bit Todd On 1/30/20, 6:05 PM, "demerphq via RT" <bug-re-engine-PCRE@rt.cpan.org> wrote: <URL: https://urldefense.com/v3/__https://rt.cpan.org/Ticket/Display.html?id=131619__;!!ORgEfCBsr282Fw!76YhgsEhi3e_zqo-k08Dzz6S4JjVSDmyFacuY1b-KaIUxu33uC3hwuQljMr_6BcDQOA$ > On Fri, 31 Jan 2020, 06:22 Todd Richmond via RT, < bug-re-engine-PCRE@rt.cpan.org> wrote: Show quoted text
> Thu Jan 30 17:22:16 2020: Request 131619 was acted upon. > Transaction: Ticket created by trichmond@proofpoint.com > Queue: re-engine-PCRE > Subject: UTF8 capture problem > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: trichmond@proofpoint.com > Status: new > Ticket <URL: https://urldefense.com/v3/__https://rt.cpan.org/Ticket/Display.html?id=131619__;!!ORgEfCBsr282Fw!76YhgsEhi3e_zqo-k08Dzz6S4JjVSDmyFacuY1b-KaIUxu33uC3hwuQljMr_6BcDQOA$ > > > > There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being > set on the perl regex object to ensure that all captures are correctly > computed as UTF8 when the input is UTF8. There are 2 critical issues > involved that are fixed by this > > > 1. All captures as well as ${^PREMATCH} and ${^POSTMATCH} will > correctly have their utf8 bits set > 2. $+[0] and $-[0] (offsets of captures) will be computed correctly for > utf8 chars rather than byte offset. When these are wrong, it is impossible > to compute a substring for match in the original text instead of using > ${^POSTMATCH} which is required due to a horrific perf problem > > XS code will need to do something like this > > > #ifdef RXf_UTF8 > > if (flags & RXf_UTF8) > > extflags |= RXf_MATCH_UTF8; > > #else > > if (SvUTF8(pattern)) > > extflags |= RXf_MATCH_UTF8; > > #endif >
I am not clear if this is a bug in perl or a bug in the alternate engine.
Subject: Re: [rt.cpan.org #131619] UTF8 capture problem
Date: Fri, 31 Jan 2020 12:31:34 +0100
To: bug-re-engine-PCRE [...] rt.cpan.org
From: demerphq <demerphq [...] gmail.com>
On Fri, 31 Jan 2020 at 05:36, Todd Richmond via RT < bug-re-engine-PCRE@rt.cpan.org> wrote: Show quoted text
> Queue: re-engine-PCRE > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=131619 > > > It is a bug in re-engine-pcre that also exists in the cpan pcre2 and re2 > projects as well - only the native perl implementation seems correct. I > think this is probably a new flag and so the requirement broke all the 3rd > party RE implementations without people realizing it > > All you need to do is set the flag when you are building a UTF8 regexp > object - nothing more. Perl takes care of all the rest by checking that > flag and adjusting capture offsets and setting the SV's utf8 bit >
Ok, I see. FWIW, this flag is very old. Dating back to when I first made the regex engine plugabble. So this breakage must be old too. Yves