Bug #91798 for PPIx-Regexp: perl_version

Thu Jan 02 12:45:33 2014 monmon [...] cpan.org - Ticket created

CC:	lesamoureuses [...] gmail.com
Subject:	perl_version_introduced bug

Japanese Katakana "ム" is represented by octal code "\343\203\240". Using it and /x, perl_version_introduced value is not right. === my $re = PPIx::Regexp->new("qr/\343\203\240/x"); print $re->perl_version_introduced; #=> 5.017009 # If it's PPIx-Regexp-0.032, the result is 5.005. === "\343\203\240" is represented by hexadecimal code "\x{E3}\x{83}\x{A0}". The last word "\x{A0}" is interpreted as Whitespace. === # use Data::Dumper 'children' => [ bless( { 'content' => ' }, 'PPIx::Regexp::Token::Literal' ), bless( { 'content' => '�' }, 'PPIx::Regexp::Token::Literal' ), bless( { 'perl_version_introduced' => '5.017009', 'content' => '�' }, 'PPIx::Regexp::Token::Whitespace' ) ], ===

Fri Jan 03 13:18:20 2014 wyant [...] cpan.org - Correspondence added

On Thu Jan 02 12:45:33 2014, MONMON wrote: Show quoted text

> Japanese Katakana "ム" is represented by octal code "\343\203\240". > Using it and /x, perl_version_introduced value is not right. > > > === > my $re = PPIx::Regexp->new("qr/\343\203\240/x"); > print $re->perl_version_introduced; #=> 5.017009 # If it's PPIx- > Regexp-0.032, the result is 5.005. > === > > > "\343\203\240" is represented by hexadecimal code > "\x{E3}\x{83}\x{A0}". > The last word "\x{A0}" is interpreted as Whitespace. > > > === > # use Data::Dumper > 'children' => [ > bless( { > 'content' => ' > }, 'PPIx::Regexp::Token::Literal' ), > bless( { > 'content' => '�' > }, 'PPIx::Regexp::Token::Literal' ), > bless( { > 'perl_version_introduced' => '5.017009', > 'content' => '�' > }, 'PPIx::Regexp::Token::Whitespace' ) > ], > ===

Thank you for your report. It seems to contain all sorts of interesting things. The first is the perl_version_introduced thing. I believe the correct response is 5.005, because that is when 'qr{}' was introduced. And that is the result produced by demonstration program eg/predump, which I rely on heavily for troubleshooting. But when I cut-and-paste your code into a stand-alone Perl script, I get 5.017009, just as you do, and it is far from obvious to me why. The information that it worked correctly in 0.032 is valuable, because it means I can investigate based on the changes between the two versions. The second thing is that it looks to me to be desirable for PPIx::Regexp to parse the content of the regexp as a single Unicode character, rather than as three escape sequences. I am not sure how to make that happen, since one of the requirements for the module is that it NOT eval() strings. For the moment it will have to just go on the wish list.

Fri Jan 03 13:18:21 2014 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 04 18:22:54 2014 wyant [...] cpan.org - Correspondence added

It took me a while to understand what was going on, but eventually I got it, I think. The basic problem was that I should not have implemented the change described as "Allow non-ASCII white space under /x." But I misunderstood what perl5179delta said was happening with non-ASCII white space. Also, I was using \s to detect white space for the purpose of blessing tokens into PPIx::Regexp::Token::Whitespace rather than PPIx::Regexp::Token::Literal. But \s matches too much. In fact, in the code installed in version 0.033, it was the \s that was matching "\240". So the \s has been replaced by an explicit character class. The contents of this class were verified both by the docs and by actually reading regcomp.c. These changes are in version 0.036, which just went to PAUSE, and should be appearing on CPAN mirrors in a few hours. I will leave the RT ticket open for a week or so, and then close it if there are no further problems.

Sat Jan 04 18:22:54 2014 wyant [...] cpan.org - Status changed from 'open' to 'patched'

Sat Jan 04 20:22:52 2014 monmon [...] cpan.org - Correspondence added

RT-Send-CC:

lesamoureuses [...] gmail.com

Thank you so much for looking into this issue! It has been resolved! On 2014-1月-04 土 18:22:54, WYANT wrote: Show quoted text

> It took me a while to understand what was going on, but eventually I > got it, I think. > > The basic problem was that I should not have implemented the change > described as "Allow non-ASCII white space under /x." But I > misunderstood what perl5179delta said was happening with non-ASCII > white space. > > Also, I was using \s to detect white space for the purpose of blessing > tokens into PPIx::Regexp::Token::Whitespace rather than > PPIx::Regexp::Token::Literal. But \s matches too much. In fact, in the > code installed in version 0.033, it was the \s that was matching > "\240". So the \s has been replaced by an explicit character class. > The contents of this class were verified both by the docs and by > actually reading regcomp.c. > > These changes are in version 0.036, which just went to PAUSE, and > should be appearing on CPAN mirrors in a few hours. I will leave the > RT ticket open for a week or so, and then close it if there are no > further problems.

Sat Jan 11 18:29:15 2014 wyant [...] cpan.org - Status changed from 'patched' to 'resolved'

Sat Jan 11 18:29:16 2014 wyant [...] cpan.org - Fixed in 0.036 added

Sat Jan 11 18:29:16 2014 wyant [...] cpan.org - Fixed in 0.032 deleted

Bug #91798 for PPIx-Regexp: perl_version_introduced bug

Preferred bug tracker