Skip Menu |

This queue is for tickets about the re-engine-RE2 CPAN distribution.

Report information
The Basics
Id: 67154
Status: open
Priority: 0/
Queue: re-engine-RE2

People
Owner: Nobody in particular
Requestors: POWERMAN [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Subject: limited support for look-ahead and \G
One of typical usage for look-ahead is split string into parts for independent processing, like this: while ($html =~ /<h2>(.*?)(?=<h2|$)/imsg) { my $part = $1; while ($part =~ /.../g) {} } With re2 this may be rewritten in this way: while ($html =~ /<h2>(.*?)(<h2|$)/imsg) { pos($html) = pos($html) - length($2); my $part = $1; while ($part =~ /.../g) {} } I think this "special" case can be handled by re2 module internally - i.e. if re2 detect _one_ look-ahead at _end_ or regex, it may replace it with usual capturing parentheses, and after executing regexp update pos() and remove extra $n var (or leave extra var in place if removing it will be too complex, just mention this behavior in doc). As for \G, I'm not 100% sure, but I remember there was some re2- specific features which may be used to tie match to some position in string. In this is true, then, again, as special case re2 module can replace \G at _beginning_ of regex with call to re2-specific function to tie match to current pos() value.
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sat, 2 Apr 2011 20:33:52 +0100
To: bug-re-engine-RE2 [...] rt.cpan.org
From: David Leadbeater <dgl [...] dgl.cx>
On 1 Apr 2011, at 18:08, Alex Efros via RT wrote: [...] Show quoted text
> I think this "special" case can be handled by re2 module internally - > i.e. if re2 detect _one_ look-ahead at _end_ or regex, it may replace > it with usual capturing parentheses, and after executing regexp update > pos() and remove extra $n var (or leave extra var in place if removing > it will be too complex, just mention this behavior in doc).
I really don't want to get into trying to parse regexps if I can avoid it. Show quoted text
> As for \G, I'm not 100% sure, but I remember there was some re2- > specific features which may be used to tie match to some position in > string. In this is true, then, again, as special case re2 module can > replace \G at _beginning_ of regex with call to re2-specific function > to tie match to current pos() value.
Again, really don't want to get into parsing (obviously could just do something very basic like \G has to start at the first byte, but not sure I like that, a patch might convince me otherwise though ;) ). RE2 does have a FindAndConsume API call, I wonder if something that allows Perl to call to this would be nicer (but obviously the interoperability with Perl RE goes away if this is non-standard). Sorry to be so reluctant but I don't really feel it's a direction that I'd find that useful and it's a lot of work which seems like it could be rather fragile and I don't really want to be maintaining. However I'd be willing to maybe consider something like a subclass that provides this functionality. David
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sat, 2 Apr 2011 23:47:23 +0300
To: David Leadbeater via RT <bug-re-engine-RE2 [...] rt.cpan.org>
From: Alex Efros <powerman [...] powerman.name>
Hi! Show quoted text
> Sorry to be so reluctant…
No, you right and I was expecting this sort of answer. I generally didn't like to add this sort of functionality (and support it) too. But, thing is, we either have vanilla RE2 (without features like these) or drop-in replacement for perl's default regex engine which should be better than default engine in _some_ cases. In first case users will expect RE2 benefits (like guaranteed run in time linear in the size of the input), but "fall back" to default engine for some cases will break this guarantee and deceive the user's expectations. I think there is should be sort of 'strict' mode for this case, in which your module just refuse to compile/execute regexps unsupported by RE2 - and thus user will know: if his app compile (or run without runtime errors - that depends on your implementation) then all used regexps are compatible with RE2 and perl's default engine will not be used. In second case users will expect some 'magic' performance improvement for part of used regexps. To do this work as best as possible, features like I proposed with \G, look-ahead and probably some other similar special cases should be implemented, just to increase amount of "supported" by RE2 regexps and so increase effect of 'magic' speedup. Moreover, both cases can be implemented at once: use re::engine::RE2 qw(strict); will croak() on unsupported regexps, while use re::engine::RE2; will just try to use RE2 for as much regexps as possible and fall back to default engine otherwise. Everything above is just abstract thought about ways to make this module more useful/usable. As for me, I think 'strict' mode is much ease to implement and much more important to have, than 'magic' mode with a lot of special cases and optimizations. -- WBR, Alex.
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sun, 3 Apr 2011 13:21:09 +0300
To: David Leadbeater via RT <bug-re-engine-RE2 [...] rt.cpan.org>
From: Alex Efros <powerman [...] powerman.name>
Hi! On Sat, Apr 02, 2011 at 03:34:04PM -0400, David Leadbeater via RT wrote: Show quoted text
> > As for \G, I'm not 100% sure, but I remember there was some re2- > > specific features which may be used to tie match to some position in
> RE2 does have a FindAndConsume API call…
BTW, I'm right now updating my RE2 wrapper for OS Inferno (http://code.google.com/p/inferno-re2/) to latests RE2, and I notice Match() function API was changed: endpos parameter was added. That happens in last update, 1 month ago: http://code.google.com/p/re2/source/detail?r=d9f8806c004d00cbfed45081afd591f78ab22818 RE2 version included in re-engine-RE2-0.06 is older. So, looks like to implement \G it's enough to set endpos=startpos. -- WBR, Alex.
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sun, 3 Apr 2011 16:56:41 +0100
To: bug-re-engine-RE2 [...] rt.cpan.org
From: David Leadbeater <dgl [...] dgl.cx>
On 2 Apr 2011, at 21:47, Alex Efros via RT wrote: Show quoted text
> Everything above is just abstract thought about ways to make this module > more useful/usable. As for me, I think 'strict' mode is much ease to > implement and much more important to have, than 'magic' mode with a lot of > special cases and optimizations.
Yes, definitely, I've started maintaining a TODO list in the git repo and strict mode is now on that :) On 3 Apr 2011, at 11:21, Alex Efros via RT wrote: [...] Show quoted text
> So, looks like to implement \G it's enough to set endpos=startpos.
That gets the resuming matches part (although you could previously do that by constructing a StringPiece appropriately albeit with some caveats). The part of \G that needs support from the regexp engine is telling you where the zero-width assertion was true. I suspect it would be possible to replace \G with an empty capturing group and filter it out from the returned match, but this goes back to how fiddly this all is (and I really don't fancy implementing a regexp engine on top of re2, this should be in if anywhere).
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sun, 3 Apr 2011 19:35:19 +0300
To: David Leadbeater via RT <bug-re-engine-RE2 [...] rt.cpan.org>
From: Alex Efros <powerman [...] powerman.name>
Hi! On Sun, Apr 03, 2011 at 11:56:56AM -0400, David Leadbeater via RT wrote: Show quoted text
> The part of \G that needs support from the regexp engine is telling you > where the zero-width assertion was true. I suspect it would be possible
Sorry, I didn't get it. If Match() 'endpos' parameter is what I think it is (maximum position for _beginning_ of the match, not for _end_ of the match) then (if we imagine \G is natively supported by RE2): /\Gabc/->Match(s, 0, s.size(), RE2::UNANCHORED, …); should be same as: /abc/->Match(s, 0, 0, RE2::UNANCHORED, …); Of course, instead of all 0's we should use value of perl's pos($s). -- WBR, Alex.
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sun, 3 Apr 2011 18:13:54 +0100
To: bug-re-engine-RE2 [...] rt.cpan.org
From: David Leadbeater <dgl [...] dgl.cx>
On 3 Apr 2011, at 17:35, Alex Efros via RT wrote: [...] Show quoted text
> Sorry, I didn't get it. If Match() 'endpos' parameter is what I think it is > (maximum position for _beginning_ of the match, not for _end_ of the match) > then (if we imagine \G is natively supported by RE2):
Unfortunately that's not what it means. See http://code.google.com/p/re2/source/diff?spec=svnd9f8806c004d00cbfed45081afd591f78ab22818&r=d9f8806c004d00cbfed45081afd591f78ab22818&format=side&path=/re2/re2.cc#sc_svnd9f8806c004d00cbfed45081afd591f78ab22818_520 It literally means the size of the full string to match.
Subject: Re: [rt.cpan.org #67154] limited support for look-ahead and \G
Date: Sun, 3 Apr 2011 22:17:10 +0300
To: David Leadbeater via RT <bug-re-engine-RE2 [...] rt.cpan.org>
From: Alex Efros <powerman [...] powerman.name>
Hi! On Sun, Apr 03, 2011 at 01:14:05PM -0400, David Leadbeater via RT wrote: Show quoted text
> Unfortunately that's not what it means.
Ahh, I see, thanks. Anyway, this will do the work (I've just tested it): pattern.Match(s, currentpos, s.size(), RE2::ANCHOR_START, …) -- WBR, Alex.
On Sun Apr 03 11:56:55 2011, DGL wrote: Show quoted text
> On 2 Apr 2011, at 21:47, Alex Efros via RT wrote:
> > Everything above is just abstract thought about ways to make this > > module > > more useful/usable. As for me, I think 'strict' mode is much ease to > > implement and much more important to have, than 'magic' mode with a > > lot of > > special cases and optimizations.
> > Yes, definitely, I've started maintaining a TODO list in the git repo > and strict mode is now on that :) > > On 3 Apr 2011, at 11:21, Alex Efros via RT wrote: > [...]
> > So, looks like to implement \G it's enough to set endpos=startpos.
> > That gets the resuming matches part (although you could previously do > that by constructing a StringPiece appropriately albeit with some > caveats). > > The part of \G that needs support from the regexp engine is telling > you where the zero-width assertion was true. I suspect it would be > possible to replace \G with an empty capturing group and filter it out > from the returned match, but this goes back to how fiddly this all is > (and I really don't fancy implementing a regexp engine on top of re2, > this should be in if anywhere).
I'm curious - is there functionality that's lacking in the Perl core to make this possible, or is this just a PITA to add? -- Matthew Horsfall (alh)