Bug #58256 for Regexp-Assemble: \d matches more than [0-9] (unicode)

Tue Jun 08 21:58:32 2010 dolmen [...] cpan.org - Ticket created

Subject:

\d matches more than [0-9] (unicode)

Here is a test case: $ perl -MRegexp::Assemble -C -E 'say Regexp::Assemble->new->add(qw(0 1 2 3 4 5 6 7 8 9))->as_string' Output of R::A 0.34: \d This is wrong because \d matches more than [0-9]: it matches any unicode digit, including digits in other writings than latin. For example, \x{0966} is matched by \d: $ perl -C -E 'say "Matched! \x{0966}" if "\x{0966}" =~ /^\d$/' The Java API documentation has a list of ranges of unicode digits: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isDigit%28char%29 -- Olivier Mengué - http://o.mengue.free.fr/

Wed Jun 16 10:47:46 2010 dland [...] cpan.org - Correspondence added

On Tue Jun 08 21:58:32 2010, DOLMEN wrote: Show quoted text

> Here is a test case: > $ perl -MRegexp::Assemble -C -E 'say Regexp::Assemble->new->add(qw(0 1 > 2 > 3 4 5 6 7 8 9))->as_string' > > Output of R::A 0.34: > \d > > This is wrong because \d matches more than [0-9]: it matches any > unicode > digit, including digits in other writings than latin. > > For example, \x{0966} is matched by \d: > $ perl -C -E 'say "Matched! \x{0966}" if "\x{0966}" =~ /^\d$/' > > The Java API documentation has a list of ranges of unicode digits: >

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isDigit%28char%29 Salut Olivier, you are entirely correct in your analysis. Unfortunately, I wrote the module before I discovered the joys of Unicode. To deactivate this behaviour and have your pattern work as you expect, you should call the fold_meta_pairs(0) method (feeding it a false value). This may also be passed in as an attribute to the constructor. E.g.: $r = Regexp::Assemble->new(fold_meta_pairs=>0); This will eventually become the default behaviour, probably when I finish writing a faster lexer to pull apart the input patterns. Thanks for your report, David

Wed Jun 16 10:47:47 2010 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 16 10:47:47 2010 dland [...] cpan.org - Status changed from 'open' to 'resolved'

Thu Jul 21 05:26:51 2011 dolmen [...] cpan.org - Correspondence added

Show quoted text

> To deactivate this behaviour and have your pattern work as you expect, > you should call the fold_meta_pairs(0) method (feeding it a false > value). This may also be passed in as an attribute to the constructor.

That does not seem to work: perl -MRegexp::Assemble -E "say Regexp::Assemble->new(fold_meta_pairs=>0)->add(0..9)->as_string" \d This is Regexp::Assemble 0.35. -- Olivier Mengué - http://search.cpan.org/~dolmen/ http://github.com/dolmen/

Thu Jul 21 05:26:53 2011 The RT System itself - Status changed from 'resolved' to 'open'

Wed Feb 07 21:47:36 2018 RSAVAGE [...] cpan.org - Correspondence added

Perhaps https://metacpan.org/release/Regexp-Parsertron will help in some way.