Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the PPI CPAN distribution.

Report information
The Basics
Id: 48265
Status: open
Priority: 0/
Queue: PPI

People
Owner: Nobody in particular
Requestors: JSTENZEL [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 1.203
Fixed in: (no value)



Subject: Special Characters in Formats Make PPI::Document::new() fail
Special characters in Perl formats make PPI::Document::new() fail. The following script defines a simple format: format STDOUT = ä@<<<<<<< 'Name' . write STDOUT; When trying to build a PPI document for this script by use PPI; use PPI::Dumper; my $module=new PPI::Document('format.pl'); my $dumper=PPI::Dumper->new($module) or die PPI::Document->errstr; the constructor fails with Fatal error... regex failed to match in 'ä@<<<<<<< ' when expected at /.../site_perl/5.10.0/PPI/Token/Word.pm line 178. The failure is caused by the special character (German umlaut "ä") in the picture line - PPI::Document::new() succeeds when this character is removed. According to perlform, all literal characters are valid in format definition picture lines: "Picture lines contain output field definitions, intermingled with literal text." So, it would be fine if PPI (and tools based on it) could handle the special character. When looking at the PPI dump for a format variation without the special character, it seems to me PPI is not aware of the special format definition "context". Instead, it seems to treat the tokens as if they were pure code, interpreting "<<" as an operator, for example: PPI::Document PPI::Statement PPI::Token::Word 'format' PPI::Token::Whitespace ' ' PPI::Token::Word 'STDOUT' PPI::Token::Whitespace ' ' PPI::Token::Operator '=' PPI::Token::Whitespace '\n' PPI::Token::Cast '@' PPI::Token::Operator '<<' PPI::Token::Operator '<<' PPI::Token::Operator '<<' PPI::Token::Operator '<' PPI::Token::Whitespace '\n' PPI::Token::Whitespace ' ' PPI::Token::Quote::Single ''Name'' PPI::Token::Whitespace '\n' PPI::Token::Operator '.' PPI::Token::Whitespace '\n' PPI::Token::Whitespace '\n' PPI::Token::Word 'write' PPI::Token::Whitespace ' ' PPI::Token::Word 'STDOUT' PPI::Token::Structure ';' PPI::Token::Whitespace '\n' The assumption/theory that format definitions are tokenized without treating them special is supported by the fact that when the special character is embedded into quotes, PPI can handle it without problem. (Unfortunately, this is no workaround as the quotes are literal characters from the formats point of view.) format STDOUT = 'ä'@<<<<<<< 'Name' . write STDOUT; Here is the PPI dump of this script: PPI::Document PPI::Statement PPI::Token::Word 'format' PPI::Token::Whitespace ' ' PPI::Token::Word 'STDOUT' PPI::Token::Whitespace ' ' PPI::Token::Operator '=' PPI::Token::Whitespace '\n' PPI::Token::Quote::Single ''ä'' PPI::Token::Cast '@' PPI::Token::Operator '<<' PPI::Token::Operator '<<' PPI::Token::Operator '<<' PPI::Token::Operator '<' PPI::Token::Whitespace '\n' PPI::Token::Whitespace ' ' PPI::Token::Quote::Single ''Name'' PPI::Token::Whitespace '\n' PPI::Token::Operator '.' PPI::Token::Whitespace '\n' PPI::Token::Whitespace '\n' PPI::Token::Word 'write' PPI::Token::Whitespace ' ' PPI::Token::Word 'STDOUT' PPI::Token::Structure ';' PPI::Token::Whitespace '\n' I am using PPI 1.203 with a non-threading perl 5.10.0 under Linux. Thanks in advance!
Subject: format_with_special_character.pl
format STDOUT = ä@<<<<<<< 'Name' . write STDOUT;
Subject: format_without_special_character.pl
format STDOUT = @<<<<<<< 'Name' . write STDOUT;
Subject: ppiDumper.pl
use strict; use warnings; use PPI; use PPI::Dumper; my $module=new PPI::Document($ARGV[0]); my $dumper=PPI::Dumper->new($module) or die PPI::Document->errstr; $dumper->print;
Subject: format_with_quoted_special_character.pl
format STDOUT = 'ä'@<<<<<<< 'Name' . write STDOUT;
Subject: Re: [rt.cpan.org #48265] Special Characters in Formats Make PPI::Document::new() fail
Date: Wed, 29 Jul 2009 02:36:44 +1000
To: bug-PPI [...] rt.cpan.org
From: Adam Kennedy <adamkennedybackup [...] gmail.com>
Thanks for the report. I plan to take another look at unicode after the upcoming release is out. Adam K 2009/7/28 JSTENZEL via RT <bug-PPI@rt.cpan.org>: Show quoted text
> Tue Jul 28 04:52:47 2009: Request 48265 was acted upon. > Transaction: Ticket created by JSTENZEL >       Queue: PPI >     Subject: Special Characters in Formats Make PPI::Document::new() fail >   Broken in: 1.203 >    Severity: Important >       Owner: Nobody >  Requestors: JSTENZEL@cpan.org >      Status: new >  Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=48265 > > > > Special characters in Perl formats make PPI::Document::new() fail. The > following script defines a simple format: > >  format STDOUT = >  ä@<<<<<<< >    'Name' >  . > >  write STDOUT; > > When trying to build a PPI document for this script by > >  use PPI; >  use PPI::Dumper; > >  my $module=new PPI::Document('format.pl'); > >  my $dumper=PPI::Dumper->new($module) >   or die PPI::Document->errstr; > > the constructor fails with > >  Fatal error... regex failed to match in 'ä@<<<<<<< >  ' when expected at /.../site_perl/5.10.0/PPI/Token/Word.pm line 178. > > The failure is caused by the special character (German umlaut "ä") in > the picture line - PPI::Document::new() succeeds when this character is > removed. > > According to perlform, all literal characters are valid in format > definition picture lines: "Picture lines contain output field > definitions, intermingled with literal text." So, it would be fine if > PPI (and tools based on it) could handle the special character. > > When looking at the PPI dump for a format variation without the special > character, it seems to me PPI is not aware of the special format > definition "context". Instead, it seems to treat the tokens as if they > were pure code, interpreting "<<" as an operator, for example: > >  PPI::Document >    PPI::Statement >      PPI::Token::Word    'format' >      PPI::Token::Whitespace      ' ' >      PPI::Token::Word    'STDOUT' >      PPI::Token::Whitespace      ' ' >      PPI::Token::Operator        '=' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Cast    '@' >      PPI::Token::Operator        '<<' >      PPI::Token::Operator        '<<' >      PPI::Token::Operator        '<<' >      PPI::Token::Operator        '<' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Whitespace      '  ' >      PPI::Token::Quote::Single   ''Name'' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Operator        '.' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Word    'write' >      PPI::Token::Whitespace      ' ' >      PPI::Token::Word    'STDOUT' >      PPI::Token::Structure       ';' >    PPI::Token::Whitespace        '\n' > > The assumption/theory that format definitions are tokenized without > treating them special is supported by the fact that when the special > character is embedded into quotes, PPI can handle it without problem. > (Unfortunately, this is no workaround as the quotes are literal > characters from the formats point of view.) > >  format STDOUT = >  'ä'@<<<<<<< >    'Name' >  . > >  write STDOUT; > > Here is the PPI dump of this script: > >  PPI::Document >    PPI::Statement >      PPI::Token::Word    'format' >      PPI::Token::Whitespace      ' ' >      PPI::Token::Word    'STDOUT' >      PPI::Token::Whitespace      ' ' >      PPI::Token::Operator        '=' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Quote::Single   ''ä'' >      PPI::Token::Cast    '@' >      PPI::Token::Operator        '<<' >      PPI::Token::Operator        '<<' >      PPI::Token::Operator        '<<' >      PPI::Token::Operator        '<' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Whitespace      '  ' >      PPI::Token::Quote::Single   ''Name'' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Operator        '.' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Whitespace      '\n' >      PPI::Token::Word    'write' >      PPI::Token::Whitespace      ' ' >      PPI::Token::Word    'STDOUT' >      PPI::Token::Structure       ';' >    PPI::Token::Whitespace        '\n' > > I am using PPI 1.203 with a non-threading perl 5.10.0 under Linux. > > Thanks in advance! >