Bug #78434 for Module-Metadata: Rare BOM will mess up package detection

Tue Jul 17 14:19:55 2012 BBYRD [...] cpan.org - Ticket created

Subject:

Rare BOM will mess up package detection

So, because of a certain set of situations: 1. Notepad++'s "UTF-8" defaults to putting a BOM in front of the file. 2. My package line is at the very first line. 3. I use OurVersion, so the version doesn't have the package name built-in. M:M ended up not auto-detecting the package name. So, it looks like the RE just needs to detect a 0xEFBBBF at beginning of the line, or look for it when the first line is read and strip it out.

Sun Jul 29 18:51:44 2012 vpit [...] cpan.org - Correspondence added

Thanks for your report. From perlunicode, perl is supposed to recognize UTF8, UTF16-LE and UTF16-BE BOMs at the beginning of a Perl source file, so I think Module::Metadata should decode the source file appropriately when it sees the BOM. Thoughts? Vincent

Sun Jul 29 18:51:45 2012 The RT System itself - Status changed from 'new' to 'open'

Tue Jul 31 14:26:48 2012 BBYRD [...] cpan.org - Correspondence added

On Sun Jul 29 18:51:44 2012, VPIT wrote: Show quoted text

> Thanks for your report. > > From perlunicode, perl is supposed to recognize UTF8, UTF16-LE and > UTF16-BE BOMs at the beginning of a Perl source file, so I think > Module::Metadata should decode the source file appropriately when it > sees the BOM.

Nope. I talked with some of the guys on IRC about it, including doy, and there's an important distinction: Perl will decode a source file that it's actually reading/parsing, but reading a file that happens to be Perl source is a different matter. In the latter case, Perl will merely follow what binmode is doing. In the case of Module::Metadata, I would say to detect the BOM at the beginning, and if it exists, remove it. Not even Encode::Guess seems to remove BOMs if they appear in UTF-8 code.

Tue Jul 31 14:48:55 2012 dgl [...] dgl.cx - Correspondence added

Subject:	Re: [rt.cpan.org #78434] Rare BOM will mess up package detection
Date:	Tue, 31 Jul 2012 19:48:43 +0100
To:	bug-Module-Metadata [...] rt.cpan.org
From:	David Leadbeater <dgl [...] dgl.cx>

On 31 July 2012 19:26, Brendan Byrd via RT <bug-Module-Metadata@rt.cpan.org>wrote: Show quoted text

> In the case of Module::Metadata, I would say to detect the BOM at the > beginning, and if it exists, remove it. Not even Encode::Guess seems to > remove BOMs if they appear in UTF-8 code. >

I think to be correct it would have to decode UTF-16, note how this does actually work: echo "print 'hello world'" | iconv -t utf16 | perl - However just stripping the BOM would solve the reported issue. (Aside: I don't handle UTF-16 in cpangrep, maybe I should so it's possible to determine if anyone actually is insane enough to use UTF-16 to encode Perl source).

Tue Aug 21 14:02:09 2012 vpit [...] cpan.org - Correspondence added

On Mar 31 Jui 2012 14:26:48, BBYRD wrote : Show quoted text

> On Sun Jul 29 18:51:44 2012, VPIT wrote:

> > Thanks for your report. > > > > From perlunicode, perl is supposed to recognize UTF8, UTF16-LE and > > UTF16-BE BOMs at the beginning of a Perl source file, so I think > > Module::Metadata should decode the source file appropriately when it > > sees the BOM.

> > Nope. I talked with some of the guys on IRC about it, including doy, > and there's an important distinction: Perl will decode a source file > that it's actually reading/parsing, but reading a file that happens to > be Perl source is a different matter. In the latter case, Perl will > merely follow what binmode is doing.

Except that Module::Metadata is also supposed to be able to extract POD, and handing back octet POD strings to the user is not really useful. For that reason, I think that Module::Metadata should also honour "use utf8" and "=encoding", but that's another matter. Show quoted text

> In the case of Module::Metadata, I would say to detect the BOM at the > beginning, and if it exists, remove it. Not even Encode::Guess seems to > remove BOMs if they appear in UTF-8 code.

Starting from version 1.000011, Module::Metadata->new_from_file and ->new_from_module look for a UTF-8/UTF-16LE/UTF-16BE BOM at the beginning of the file, skip it, then decode appropriately the rest of the input. Module::Metadata->new_from_handle is untouched. The decoding part is easily removable if deemed harmful.

Tue Aug 21 14:44:35 2012 dagolden [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #78434] Rare BOM will mess up package detection
Date:	Tue, 21 Aug 2012 14:43:55 -0400
To:	bug-Module-Metadata [...] rt.cpan.org
From:	David Golden <dagolden [...] cpan.org>

On Tue, Aug 21, 2012 at 2:02 PM, Vincent Pit via RT < bug-Module-Metadata@rt.cpan.org> wrote: Show quoted text

> Except that Module::Metadata is also supposed to be able to extract POD, > and handing back octet POD strings to the user is not really useful. For > that reason, I think that Module::Metadata should also honour "use utf8" > and "=encoding", but that's another matter. >

+1 for =encoding. I'm not sure about "use utf8". What does 'perldoc' do? -- David

Sun Mar 16 03:46:15 2014 ether [...] cpan.org - Correspondence added

Issue open on github for figuring out what to do here: https://github.com/Perl-Toolchain-Gang/Module-Metadata/issues/2

Bug #78434 for Module-Metadata: Rare BOM will mess up package detection

Maintainer(s)' notes