Bug #14289 for Regexp-Assemble: Regexp::Assemble can produce invalid regexp (imbalanced parens) out of good parts

Tue Aug 23 23:46:26 2005 Guest - Ticket created

Subject:

Regexp::Assemble can produce invalid regexp (imbalanced parens) out of good parts

Error in Regexp::Assemble 0.15, most likely not fixed in the latest release (0.16). The following code produces a bad regexp out of good parts. #! perl -wl use Regexp::Assemble; my $re = Regexp::Assemble->new; while(<DATA>) { chomp; /\S/ or next; tr(/\\)(/)s; $_ = quotemeta; s((?<!\\/)$)((?=\\/)); s(\\/)([\\\\/])g; print; $re->add("^(?i:$_)"); } print $re; __DATA__ Q:\ C:\WINDOWS\Profiles\Bart\irate F:\MP3 G:\ The output is (displaying each of the regexp parts, as well as the error message): Q\:[\\/] C\:[\\/]WINDOWS[\\/]Profiles[\\/]Bart[\\/]irate(?=[\\/]) F\:[\\/]MP3(?=[\\/]) G\:[\\/] Unmatched ( in regex; marked by <-- HERE in m/^( <-- HERE ?:(?:(?i:C\:[\\/]WINDOWS[\\/]Profiles[\\/]Bart[\\/]irate(?=[\\/])|(?i:F\:[\\/]MP3(?=[\\/])))|(?i:G\:[\\/])|(?i:Q\:[\\/]))/ at c:/Perl/site/lib/Regexp/Assemble.pm line 534, <DATA> chunk 1. The code itself was intended to match recognized root directories on Windows. Just in case you're wondering... :)

Wed Aug 24 04:16:53 2005 dland [...] cpan.org - Taken

Wed Aug 24 04:54:29 2005 dland [...] cpan.org - Correspondence added 20 min

The problem you are encountering is that R::A uses a simple lexer to chop up each pattern with a (documented :) limitation: it fails to pull apart patterns containing nested parentheses correctly, and the patterns you are feeding it do contain nested parens: the trailing ZWLA (?=...) is nested within a (?i...). If I change the script a bit we have: #! perl -w use Regexp::Assemble; my $re = Regexp::Assemble->new->debug(1); while(<DATA>) { chomp; /\S/ or next; tr(/\\)(/)s; $_ = quotemeta; s((?<!\\/)$)((?=\\/)); s(\\/)([\\\\/])g; $_ = "^(?i:$_)"; $re->add( $_ ); } print $re->as_string, "\n"; __DATA__ C:\W F:\M this produces: _insert_path [^ (?i:C\:[\\/]W(?=[\\/]) )] into [] at path () added remaining [^ (?i:C\:[\\/]W(?=[\\/]) )] _insert_path [^ (?i:F\:[\\/]M(?=[\\/]) )] into [^ (?i:C\:[\\/]W(?=[\\/]) )] at path (off=<^> (?i:C\:[\\/]W(?=[\\/]) )) at path (^ off=<(?i:C\:[\\/]W(?=[\\/])> )) token (?i:F\:[\\/]M(?=[\\/]) not present result=^(?:(?i:C\:[\\/]W(?=[\\/])|(?i:F\:[\\/]M(?=[\\/]))) -- The main thing to note is that the pattern was tokenised as ^ (?i:C\:[\\/]W(?=[\\/]) ) Both patterns will reduce and share the ^ and trailing ), leaving the two inner fragments with unbalanced parens. Hence the error. The main problem comes with the wrapping of the patterns in an all-encompassing (?i...) If you could munge the strings so as to arrive at, e.g.: ^(?i:C\:[\\/]W)(?=[\\/]) It would be tokenised correctly since the parens are no longer nested: ^ (?i:C\:[\\/]W) (?=[\\/]) Now the trailing (?=[\\/]) would be shared among all the source patterns, which will make for a smaller regexp. You should be able to use the flags('i') method to set the /i flag globally for the whole pattern (although it is true that the flag method was ignored for tracked patterns for all versions prior to 0.16). Getting rid of the (?i...) wrapper has another benefit: if you have two paths C:/X and C:/Y (keeping path separators out the picture to simplify the issue), the resulting pattern will not be C:/[XY], but rather (?iC:/X)|(?iC:/Y). When testing a string such as C:/Z, it will have to inspect both alternations before concluding that a match is impossible, instead of walking down the one pattern and failing at the character class.

Wed Aug 24 04:54:29 2005 dland [...] cpan.org - Status changed from 'new' to 'resolved'