Skip Menu |

This queue is for tickets about the Regexp-Assemble CPAN distribution.

Report information
The Basics
Id: 106480
Status: open
Priority: 0/
Queue: Regexp-Assemble

People
Owner: Nobody in particular
Requestors: daxim [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: disable processing on Perl >= 5.10
Quoting https://metacpan.org/pod/release/RGARCIA/perl-5.10.0/pod/perl5100delta.pod#Trie-optimisation-of-literal-string-alternations Show quoted text
> Note: Much code exists that works around perl's historic poor performance on alternations. Often the tricks used to do so will disable the new optimisations. Hopefully the utility modules used for this purpose will be educated about these new optimisations.
Therefore on Perl 5.10 and later, R::A should just return the result of Data::Munge::list2re and let have perl optimise the regex instead.
There is a discussion of this in the module's Description: https://metacpan.org/pod/Regexp::Assemble. If it's failing to active that logic on recent Perls, that's a problem. My immediate reaction is that /I/ am reluctant to disable something just depending on the version of Perl, since some users may not have any, or much, alternation in their code. But since I know so little about the module's internals, please continue to spell out any concerns you have.
In fact, it might be best if end users simply checked their version of Perl (i.e. at runtime) and switch to Data::Munge automatically. Of course, that does suggest I update the Description or Limitations, to cover this case in a bit more detail.
Show quoted text
> If it's failing to active that logic on recent Perls, that's a problem.
I believe it does not, otherwise I wouldn't have opened this bug. Show quoted text
> /I/ am reluctant to disable something just depending on the version of Perl
There is already much Perl-version dependent code in R::A. That's perfectly normal. Show quoted text
> some users may not have any, or much, alternation in their code
That's wrong. As soon as one add()s two regexes, there is an alternation, so the vast majority of users have them. Show quoted text
> I know so little about the module's internals
It's *your* responsibility as a maintainer to understand the internals. Don't shift that burden onto someone else, e.g. bug reporters. If you need help with the trie regex optimisations from 5.10, you know where to find p5p.
Subject: Re: [rt.cpan.org #106480] disable processing on Perl >= 5.10
Date: Thu, 20 Aug 2015 08:37:42 +1000
To: bug-Regexp-Assemble [...] rt.cpan.org
From: Ron Savage <ron [...] savage.net.au>
Hi Lars On 19/08/15 20:29, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 via RT wrote: Show quoted text
> Queue: Regexp-Assemble > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=106480 > >
>> If it's failing to active that logic on recent Perls, that's a problem.
> > I believe it does not, otherwise I wouldn't have opened this bug.
OK. Show quoted text
>> /I/ am reluctant to disable something just depending on the version of Perl
> > There is already much Perl-version dependent code in R::A. That's perfectly normal.
Probably true. I did not study the source. Show quoted text
>> some users may not have any, or much, alternation in their code
> > That's wrong. As soon as one add()s two regexes, there is an alternation, so the vast majority of users have them.
OK. Show quoted text
>> I know so little about the module's internals
> > It's *your* responsibility as a maintainer to understand the internals. Don't shift that burden onto someone else, e.g. bug reporters. If you need help with the trie regex optimisations from 5.10, you know where to find p5p.
Ahh. But I'm limited by my skill set, ability and time. And being a volunteer I don't actually have to fix anything :-). But, being a volunteer means I have, err, volunteered, to fix /some/ things, and have done so, but that's not a promise to fix things beyond my understanding. The other reason I'm reluctant to dive in quickly, is that over the years I and other uses of Marpa [1] have pondered the writing of a (hopefully) definitive BNF [2] for Perl's regexps, and hence the release of a Marpa-based parser for regxps. That just might give us another, deep, way of understanding regexps which leads to re-writing this module from scratch. Yes, short term fixes for current problems is still a reasonable expectation for its users, but as I've said above, either people give me patches or I grind thru the issues as and when I can. [1] http://savage.net.au/Marpa.html As noted there, there are already quite a few Marpa-based packages in Perl. And just because I host Marpa's home page does not mean I wrote Marpa. I didn't. [2] Yes, it would have to be Perl-version dependent. -- Ron Savage - savage.net.au
Subject: Re: [rt.cpan.org #106480] disable processing on Perl >= 5.10
Date: Thu, 20 Aug 2015 16:48:24 +1000
To: bug-Regexp-Assemble [...] rt.cpan.org
From: Ron Savage <ron [...] savage.net.au>
Hi For alternatives to this module, consider one of: o Regex::PreSuf o OnSearch::Regex I'll add these to the POD. -- Ron Savage - savage.net.au
On 2015-08-16 13:57:56, DAXIM wrote: Show quoted text
> Quoting https://metacpan.org/pod/release/RGARCIA/perl- > 5.10.0/pod/perl5100delta.pod#Trie-optimisation-of-literal-string- > alternations >
> > Note: Much code exists that works around perl's historic poor > > performance on alternations. Often the tricks used to do so will > > disable the new optimisations. Hopefully the utility modules used for > > this purpose will be educated about these new optimisations.
> > Therefore on Perl 5.10 and later, R::A should just return the result > of Data::Munge::list2re and let have perl optimise the regex instead.
Unfortunately this is not that easy. The trie stuff works well as long as the internally built regexp program fits into some 2^16 tokens. If it's larger, then the regexp engine switches to the old inefficient O(n) algorithm, which is much slower than what Regexp::Assemble would generate. The 2^16 limit may be reached easily with 10000-20000 alterations (it seems to depend on length of each alteration, maybe also other factors). By using "use re qw(debug)" it is possible to output the generated regexp programs. Something with "TRIE" is good, but "BRANCH" and "LONGJMP" is bad. See below for examples. And this is not just theory. At $WORK we had a *massive* performance boost by using Regexp::Assemble for more than 10000 alterations, and let the trie implementation deal with fewer than this limit. What can Regexp::Assemble do about this all? I am not aware of an API function in perl's regexp engine to tell whether trie or non-trie regexp programs were generated (using the "use re qw(debug)" output here feels hackish). If there was one, then Regexp::Assemble could theoretically check if perl's regexp engine can generate a trie and leave it at this. Maybe more documentation on this topic would be good. But the initial request ("disable processing on Perl >= 5.10") should be rejected. Regards, Slaven # 20000 alterations, non-trie regexp program $ perl5.20.2 -MString::Random=random_string -e '$words = shift || 10000; $x = "(" . join("|", map { quotemeta random_string("cccccccccc") } (1..$words)) . ")"; warn "-"x60, "\n"; use re "debug"; $x=qr{$x}' 20000 |& head -10 ------------------------------------------------------------ Compiling REx "(habquzbhwx|ibcogqguhs|dpstybzinz|ujwvozcokf|vtbwelkwpp|xpea"... Final program: 1: OPEN1 (3) 3: BRANCHJ (11) 5: EXACT <habquzbhwx> (9) 9: LONGJMP (160001) 11: BRANCHJ (19) 13: EXACT <ibcogqguhs> (17) 17: LONGJMP (160001) # 10000 alterations, trie regexp program $ perl5.20.2 -MString::Random=random_string -e '$words = shift || 10000; $x = "(" . join("|", map { quotemeta random_string("cccccccccc") } (1..$words)) . ")"; warn "-"x60, "\n"; use re "debug"; $x=qr{$x}' 10000 |& head -10 ------------------------------------------------------------ Compiling REx "(caaianxnke|ylicpierfd|mahofrbbwv|xvuefbqicm|crzkdzibvo|kdcm"... Final program: 1: OPEN1 (3) 3: TRIEC-EXACT[a-z] (50003) <caaianxnke> <ylicpierfd> <mahofrbbwv> <xvuefbqicm> <crzkdzibvo>
Perhaps https://metacpan.org/release/Regexp-Parsertron will help in some way.