Subject: | \s and \S should not collapse into . (nor should \w / \W or \d / \D) |
Hi David,
While reading the Regexp::Assemble documentation, I saw the following
paragraph:
It also knows about meta-characters than can "absorb" regular
characters. For instance, given "X\d" and "X5", it knows that 5 can be
represented by "\d" and so the assembly is just "X\d". The "absorbent"
meta-characters it deals with are ".", "\d", "\s" and "\W" and their
complements. It will replace "\d"/"\D", "\s"/"\S" and "\w"/"\W" by "."
(dot), and it will drop "\d" if "\w" is also present (as will "\D" in
the presence of "\W").
The fact that '\s' and '\S' are merged into '.' sounds like a bug to me,
as shows the following test script:
use strict;
use warnings;
use Test::More;
use Assemble;
# given a list of strings
my @str = ( 'a b', 'awb', 'a1b', 'bar', "a\nb" );
plan tests => 3 * @str;
for my $meta (qw( s w d )) {
# given a list of patterns
my @re = ( "a\\${meta}b", "a\\@{[uc$meta]}b" );
# produce an assembled pattern
my $re = Regexp::Assemble->new()->add(@re)->re();
# test it against the strings
for my $str (@str) {
# any match?
my $ok = '';
$str =~ $_ && ( $ok = 1 ) for @re;
# does the assemble regexp match as well?
my $ptr = $str;
$ptr =~ s/\\/\\\\/;
$ptr =~ s/\n/\\n/;
is( $str =~ $re,
$ok, "Assembled regexp behaves as the list for \\$meta
($ptr)" )
}
}
The execution produces (under Win32 and Linux):
1..15
ok 1 - Assembled regexp behaves as the list for \s (a b)
ok 2 - Assembled regexp behaves as the list for \s (awb)
ok 3 - Assembled regexp behaves as the list for \s (a1b)
ok 4 - Assembled regexp behaves as the list for \s (bar)
not ok 5 - Assembled regexp behaves as the list for \s (a\nb)
# Failed test (ra.pl at line 30)
# got: ''
# expected: '1'
ok 6 - Assembled regexp behaves as the list for \w (a b)
ok 7 - Assembled regexp behaves as the list for \w (awb)
ok 8 - Assembled regexp behaves as the list for \w (a1b)
ok 9 - Assembled regexp behaves as the list for \w (bar)
not ok 10 - Assembled regexp behaves as the list for \w (a\nb)
# Failed test (ra.pl at line 30)
# got: ''
# expected: '1'
ok 11 - Assembled regexp behaves as the list for \d (a b)
ok 12 - Assembled regexp behaves as the list for \d (awb)
ok 13 - Assembled regexp behaves as the list for \d (a1b)
ok 14 - Assembled regexp behaves as the list for \d (bar)
not ok 15 - Assembled regexp behaves as the list for \d (a\nb)
# Failed test (ra.pl at line 30)
# got: ''
# expected: '1'
# Looks like you failed 3 tests of 15.
This simply shows that '.' is not the same as the assembly of \s and \S
(nor \w and \W, nor \d and \D), when one is not using the /s flag.
The patch is to produce '(?:.|\n)' instead of '.' when you cannot be sure
that the /s flag is enabled in the resulting regexp. In my opinion,
the only case when you replace such a combination with a '.' is when
the /s is explicitely set with the (?s:) construct. I don't know how it
works on different platforms that have different interpretations for
"\n".
Regards (and happy new year nonetheless),
-- BooK