Skip Menu |

This queue is for tickets about the Regexp-Assemble CPAN distribution.

Report information
The Basics
Id: 58256
Status: open
Priority: 0/
Queue: Regexp-Assemble

People
Owner: Nobody in particular
Requestors: dolmen [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.34
Fixed in: (no value)



Subject: \d matches more than [0-9] (unicode)
Here is a test case: $ perl -MRegexp::Assemble -C -E 'say Regexp::Assemble->new->add(qw(0 1 2 3 4 5 6 7 8 9))->as_string' Output of R::A 0.34: \d This is wrong because \d matches more than [0-9]: it matches any unicode digit, including digits in other writings than latin. For example, \x{0966} is matched by \d: $ perl -C -E 'say "Matched! \x{0966}" if "\x{0966}" =~ /^\d$/' The Java API documentation has a list of ranges of unicode digits: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isDigit%28char%29 -- Olivier Mengué - http://o.mengue.free.fr/
On Tue Jun 08 21:58:32 2010, DOLMEN wrote: Show quoted text
> Here is a test case: > $ perl -MRegexp::Assemble -C -E 'say Regexp::Assemble->new->add(qw(0 1 > 2 > 3 4 5 6 7 8 9))->as_string' > > Output of R::A 0.34: > \d > > This is wrong because \d matches more than [0-9]: it matches any > unicode > digit, including digits in other writings than latin. > > For example, \x{0966} is matched by \d: > $ perl -C -E 'say "Matched! \x{0966}" if "\x{0966}" =~ /^\d$/' > > The Java API documentation has a list of ranges of unicode digits: >
http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.html#isDigit%28char%29 Salut Olivier, you are entirely correct in your analysis. Unfortunately, I wrote the module before I discovered the joys of Unicode. To deactivate this behaviour and have your pattern work as you expect, you should call the fold_meta_pairs(0) method (feeding it a false value). This may also be passed in as an attribute to the constructor. E.g.: $r = Regexp::Assemble->new(fold_meta_pairs=>0); This will eventually become the default behaviour, probably when I finish writing a faster lexer to pull apart the input patterns. Thanks for your report, David
Show quoted text
> To deactivate this behaviour and have your pattern work as you expect, > you should call the fold_meta_pairs(0) method (feeding it a false > value). This may also be passed in as an attribute to the constructor.
That does not seem to work: perl -MRegexp::Assemble -E "say Regexp::Assemble->new(fold_meta_pairs=>0)->add(0..9)->as_string" \d This is Regexp::Assemble 0.35. -- Olivier Mengué - http://search.cpan.org/~dolmen/ http://github.com/dolmen/
Perhaps https://metacpan.org/release/Regexp-Parsertron will help in some way.