Bug #15353 for PPI: Failure on Unicode byte order mark

Fri Oct 28 01:15:22 2005 Guest - Ticket created

Subject:

Failure on Unicode byte order mark

PPI parsing fails if a .pm file starts with the Unicode byte-order mark (BOM -- http://www.unicode.org/faq/utf_bom.html#BOM) Attached is a simplified Japanese UTF-8 module that uses Locale::Maketext. That file has a BOM that looks like 0xefbbbf, namely the UTF-8 BOM. Note: I gzipped the attachment to prevent RT and/or browsers from mangling the BOM. If you try to parse that document as follows, you get an error message: perl -MPPI::Document -e 'PPI::Document->new("ja.pm")||print"$PPI::Document::errstr\n"' Error at line 1, character 0 Perl 5.8.6 handles this file just fine. -- Chris

Download ja.pm.gz
application/x-gzip 197b

Message body not shown because it is not plain text.

Fri Oct 28 02:49:47 2005 adam [...] phase-n.com - Correspondence added

Date:	Fri, 28 Oct 2005 16:49:20 +1000
From:	Adam Kennedy <adam [...] phase-n.com>
To:	bug-PPI [...] rt.cpan.org
Subject:	Re: [cpan #15353] Failure on Unicode byte order mark
RT-Send-Cc:

PPI does not support unicode, only the non-English characters from the latin-1 characterset. Adam K Guest via RT wrote: Show quoted text

> This message about PPI was sent to you by guest <> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > PPI parsing fails if a .pm file starts with the Unicode byte-order mark (BOM -- http://www.unicode.org/faq/utf_bom.html#BOM) > > Attached is a simplified Japanese UTF-8 module that uses Locale::Maketext. That file has a BOM that looks like 0xefbbbf, namely the UTF-8 BOM. Note: I gzipped the attachment to prevent RT and/or browsers from mangling the BOM. > > If you try to parse that document as follows, you get an error message: > > perl -MPPI::Document -e 'PPI::Document->new("ja.pm")||print"$PPI::Document::errstr\n"' > > Error at line 1, character 0 > > Perl 5.8.6 handles this file just fine. > > -- Chris

Mon Oct 31 11:20:44 2005 cpan [...] clotho.com - Correspondence added

[adam@phase-n.com - Fri Oct 28 02:49:47 2005]: Show quoted text

> PPI does not support unicode, only the non-English characters from the > latin-1 characterset.

Thanks for the clarification Adam. I've been thinking about this for a couple of days. How about a new token class called PPI::Token::BOM which is a subclass of ::Whitespace? The document would start with its initial state set to ::BOM instead of ::Whitespace. If no BOM was present, it would go on parsing as usual, switching the type to ::Whitespace. In the first version it could accept the UTF-8 BOM and choke on other BOMs. My reasoning behind this is that most Unicode perl is only unicode because the strings contain Unicode. With the exception of the BOM, most UTF-8 documents are PPI-friendly because they use only ASCII outside of strings. If you think this is a good idea, I'd be happy to write a first-draft patch and test. I've read the code of ::Whitespace, so I do understand the magnitude of this proposed change. -- Chris

Mon Oct 31 11:32:11 2005 adam [...] phase-n.com - Correspondence added

Date:	Tue, 01 Nov 2005 03:31:07 +1100
From:	Adam Kennedy <adam [...] phase-n.com>
To:	bug-PPI [...] rt.cpan.org
Subject:	Re: [cpan #15353] Failure on Unicode byte order mark
RT-Send-Cc:

The main problem here is that there's not much point in supporting one particular character from unicode if we don't support a more complete subset... or is there? I'm afraid some of the specifics of the unicode issues escape me, but that's my main issue... what's the point of just adding BOM? Adam K via RT wrote: Show quoted text

> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Fri Oct 28 02:49:47 2005]: > >

>>PPI does not support unicode, only the non-English characters from the >>latin-1 characterset.

> > > Thanks for the clarification Adam. I've been thinking about this for a > couple of days. How about a new token class called PPI::Token::BOM > which is a subclass of ::Whitespace? The document would start with its > initial state set to ::BOM instead of ::Whitespace. If no BOM was > present, it would go on parsing as usual, switching the type to > ::Whitespace. In the first version it could accept the UTF-8 BOM and > choke on other BOMs. > > My reasoning behind this is that most Unicode perl is only unicode > because the strings contain Unicode. With the exception of the BOM, > most UTF-8 documents are PPI-friendly because they use only ASCII > outside of strings. > > If you think this is a good idea, I'd be happy to write a first-draft > patch and test. I've read the code of ::Whitespace, so I do understand > the magnitude of this proposed change. > > -- Chris

Mon Oct 31 11:57:18 2005 cpan [...] clotho.com - Correspondence added

From:

cdolan [...] cpan.org

[adam@phase-n.com - Mon Oct 31 11:32:11 2005]: Show quoted text

> The main problem here is that there's not much point in supporting one > particular character from unicode if we don't support a more complete > subset... or is there? > > I'm afraid some of the specifics of the unicode issues escape me, but > that's my main issue... what's the point of just adding BOM? > > Adam K

Hi Adam, BOM support would make Locale::Maketext-based modules parseable. Those contain many L10N strings, but minimal Perl. The .pm file needs to be non-Latin-1 to support the strings, and many editors add the BOM automatically. Another potential case of Unicode docs that are nearly PPI-parseable are ones with Unicode in the POD, but just ASCII in the code. For example, if the author's name is not representable in ASCII. Looking at PPI::Token::Pod, PPI::Token::Quote::* and PPI::Token::_QuoteEngine*, it looks like they are already as Unicode-friendly as Perl is, since they only scan for special characters instead of validating at every one. So, in the simple case of a UTF-8 document that used the ASCII subset for all code, BOM support is the sole limiting factor for PPI. Note that you may not see many UTF-8 docs with BOMs on CPAN because localization is usually relegated to the application, not the libraries. So if there is a lack of BOM errors for PPI on CPAN, that may be a selection bias. Thanks, -- Chris

Mon Oct 31 12:25:35 2005 adam [...] phase-n.com - Correspondence added

Date:	Tue, 01 Nov 2005 04:24:38 +1100
From:	Adam Kennedy <adam [...] phase-n.com>
To:	bug-PPI [...] rt.cpan.org
Subject:	Re: [cpan #15353] Failure on Unicode byte order mark
RT-Send-Cc:

Not really, I haven't run the tinderbox in a while, but I purged I10N errors from the tinderbox process. And yeah, a German guy pointed out that for latin-1 support it only need to be supported in POD, comments and the quote engine for strings. He wrote up the latin-1 unit test scripts. If you think that the BOM stuff is the only thing stopping the majority of Unicode, then go ahead and try for a patch to it. If you want I can add you to the developer list for the parseperl repository and you can just work it up in a branch on the live module? Adam K via RT wrote: Show quoted text

> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Mon Oct 31 11:32:11 2005]: > >

>>The main problem here is that there's not much point in supporting one >>particular character from unicode if we don't support a more complete >>subset... or is there? >> >>I'm afraid some of the specifics of the unicode issues escape me, but >>that's my main issue... what's the point of just adding BOM? >> >>Adam K

> > > Hi Adam, > > BOM support would make Locale::Maketext-based modules parseable. Those > contain many L10N strings, but minimal Perl. The .pm file needs to be > non-Latin-1 to support the strings, and many editors add the BOM > automatically. > > Another potential case of Unicode docs that are nearly PPI-parseable are > ones with Unicode in the POD, but just ASCII in the code. For example, > if the author's name is not representable in ASCII. > > Looking at PPI::Token::Pod, PPI::Token::Quote::* and > PPI::Token::_QuoteEngine*, it looks like they are already as > Unicode-friendly as Perl is, since they only scan for special characters > instead of validating at every one. So, in the simple case of a UTF-8 > document that used the ASCII subset for all code, BOM support is the > sole limiting factor for PPI. > > Note that you may not see many UTF-8 docs with BOMs on CPAN because > localization is usually relegated to the application, not the libraries. > So if there is a lack of BOM errors for PPI on CPAN, that may be a > selection bias. > > Thanks, > -- Chris

Mon Oct 31 12:29:05 2005 cpan [...] clotho.com - Correspondence added

From:

cdolan [...] cpan.org

[adam@phase-n.com - Mon Oct 31 12:25:35 2005]: Show quoted text

> Not really, I haven't run the tinderbox in a while, but I purged I10N > errors from the tinderbox process. > > And yeah, a German guy pointed out that for latin-1 support it only > need > to be supported in POD, comments and the quote engine for strings. > > He wrote up the latin-1 unit test scripts. > > If you think that the BOM stuff is the only thing stopping the > majority > of Unicode, then go ahead and try for a patch to it. > > If you want I can add you to the developer list for the parseperl > repository and you can just work it up in a branch on the live module? > > Adam K

Sounds good to me. For reference, I'm usually chris @ chrisdolan.net. I make no predictions on an ETA for the patch, but I'll try to work on it soon. Thanks! -- Chris

Mon Oct 31 12:47:32 2005 adam [...] phase-n.com - Correspondence added

Date:	Tue, 01 Nov 2005 04:46:31 +1100
From:	Adam Kennedy <adam [...] phase-n.com>
To:	bug-PPI [...] rt.cpan.org
Subject:	Re: [cpan #15353] Failure on Unicode byte order mark
RT-Send-Cc:

Timeline is fine, if you contain it in a branch and work at your own pace, however long it takes is totally fine by me. What is your SourceForge account, and I'll add it to CVS permissions? Adam K via RT wrote: Show quoted text

> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Mon Oct 31 12:25:35 2005]: > >

>>Not really, I haven't run the tinderbox in a while, but I purged I10N >>errors from the tinderbox process. >> >>And yeah, a German guy pointed out that for latin-1 support it only >>need >>to be supported in POD, comments and the quote engine for strings. >> >>He wrote up the latin-1 unit test scripts. >> >>If you think that the BOM stuff is the only thing stopping the >>majority >>of Unicode, then go ahead and try for a patch to it. >> >>If you want I can add you to the developer list for the parseperl >>repository and you can just work it up in a branch on the live module? >> >>Adam K

> > > Sounds good to me. For reference, I'm usually chris @ chrisdolan.net. > I make no predictions on an ETA for the patch, but I'll try to work on > it soon. > > Thanks! > -- Chris

Mon Oct 31 13:04:01 2005 cpan [...] clotho.com - Correspondence added

From:

cdolan [...] cpan.org

[adam@phase-n.com - Mon Oct 31 12:47:32 2005]: Show quoted text

> Timeline is fine, if you contain it in a branch and work at your own > pace, however long it takes is totally fine by me. > > What is your SourceForge account, and I'll add it to CVS permissions? > > Adam K

I'm chrisdolan @ SF. -- Chris

Mon Oct 31 13:09:46 2005 adam [...] phase-n.com - Correspondence added

Date:	Tue, 01 Nov 2005 05:08:51 +1100
From:	Adam Kennedy <adam [...] phase-n.com>
To:	bug-PPI [...] rt.cpan.org
Subject:	Re: [cpan #15353] Failure on Unicode byte order mark
RT-Send-Cc:

OK, added. Go for your life. Adam K via RT wrote: Show quoted text

> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > [adam@phase-n.com - Mon Oct 31 12:47:32 2005]: > >

>>Timeline is fine, if you contain it in a branch and work at your own >>pace, however long it takes is totally fine by me. >> >>What is your SourceForge account, and I'll add it to CVS permissions? >> >>Adam K

> > > I'm chrisdolan @ SF. > > -- Chris

Mon Oct 31 15:14:03 2005 cpan [...] clotho.com - Correspondence added

From:

cdolan [...] cpan.org

I couldn't stop thinking about this, so I implemented it. I committed it on a CVS branch called "Branch_unicode_support". I tested my patch using Perl::Critic and it now succeeds to parse basic UTF-8 files that were failing before. So, if this patch reaches mainline PPI, I consider this bug closed. Note that my patch unexpectedly caused one test to pass: UTF-8 characters in the middle of barewords. UTF-8 characters at the beginning of barewords still fails. I think that's because \w is already Unicode-friedly in PPI::Token::Words -- Chris

Mon Dec 12 13:24:00 2005 cpan [...] clotho.com - Correspondence added

Hi Adam, Just a reminder that the BOM code is still in the CVS branch mentioned below. In light of the Unicode improvements that A.Tang has been pushing, I think the BOM code has more relevance. Best wishes, Chris [CLOTHO - Mon Oct 31 15:14:03 2005]: Show quoted text

> I couldn't stop thinking about this, so I implemented it. I committed > it on a CVS branch called "Branch_unicode_support". > > I tested my patch using Perl::Critic and it now succeeds to parse > basic > UTF-8 files that were failing before. So, if this patch reaches > mainline PPI, I consider this bug closed. > > Note that my patch unexpectedly caused one test to pass: UTF-8 > characters in the middle of barewords. UTF-8 characters at the > beginning of barewords still fails. I think that's because \w is > already Unicode-friedly in PPI::Token::Words > > -- Chris

Tue Dec 13 07:12:35 2005 adam [...] phase-n.com - Correspondence added

Date:	Tue, 13 Dec 2005 23:09:28 +1100
From:	Adam Kennedy <adam [...] phase-n.com>
To:	bug-PPI [...] rt.cpan.org
Subject:	Re: [cpan #15353] Failure on Unicode byte order mark
RT-Send-Cc:

According to Audrey (formerly Autrijus, as of 1 week ago) the code didn't work... so she added her Unicode stuff to the main branch rather than to the branch. We might want to talk in Freenode #perl6 about this? Any comments? Adam K via RT wrote: Show quoted text

> This message about PPI was sent to you by CLOTHO <CLOTHO@cpan.org> via rt.cpan.org > > Full context and any attached attachments can be found at: > <URL: https://rt.cpan.org/Ticket/Display.html?id=15353 > > > Hi Adam, > > Just a reminder that the BOM code is still in the CVS branch mentioned > below. In light of the Unicode improvements that A.Tang has been > pushing, I think the BOM code has more relevance. > > Best wishes, > Chris > > > [CLOTHO - Mon Oct 31 15:14:03 2005]: > >

>>I couldn't stop thinking about this, so I implemented it. I committed >>it on a CVS branch called "Branch_unicode_support". >> >>I tested my patch using Perl::Critic and it now succeeds to parse >> basic >>UTF-8 files that were failing before. So, if this patch reaches >>mainline PPI, I consider this bug closed. >> >>Note that my patch unexpectedly caused one test to pass: UTF-8 >>characters in the middle of barewords. UTF-8 characters at the >>beginning of barewords still fails. I think that's because \w is >>already Unicode-friedly in PPI::Token::Words >> >> -- Chris

> >

Sat Feb 20 23:01:14 2010 adamk [...] cpan.org - Correspondence added

Confirming this case appears to be resolved.

Sat Feb 20 23:01:19 2010 The RT System itself - Status changed from 'new' to 'open'

Sat Feb 20 23:01:21 2010 adamk [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #15353 for PPI: Failure on Unicode byte order mark

Preferred bug tracker