Skip Menu |

This queue is for tickets about the Config-General CPAN distribution.

Report information
The Basics
Id: 113671
Status: resolved
Priority: 0/
Queue: Config-General

People
Owner: Nobody in particular
Requestors: DAMI [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: 2.60



Subject: recognize BOM at start of a utf8 file
Even with option -UTF8, the parser does not accept a UTF8 file that starts with a Byte Order Mark (BOM) = 0xefbbbf. The parser should ignore the BOM, and ideally, option -UTF8 should be automatically turned on if a BOM is found at the start of the file.
hm .. works for me: % hexdump x 0000000 bbef 0abf 6176 2072 203d 4554 5458 000a 000000f % perl -MConfig::General -MData::Dumper -e '%c = Config::General::ParseConfig(-ConfigFile=>'x'); print Dumper(\%c);' $VAR1 = { 'ÿ' => undef, 'var' => 'TEXT' }; well, it parses the BOM as content (which it shouldn't), but at least it doesn't croak() or something. So, please, could you post an example (including a config file, e.g. by uploading it somewhere)? Thanks, Tom
Subject: Re: [rt.cpan.org #113671] recognize BOM at start of a utf8 file
Date: Mon, 11 Apr 2016 21:36:07 +0200
To: bug-Config-General [...] rt.cpan.org
From: Laurent Dami <ldami [...] bluewin.ch>
Le 11.04.2016 10:49, T. Linden via RT a écrit : Show quoted text
> So, please, could you post an example (including a config file, e.g. by uploading it somewhere)?
Here is an example. Parsing $plain works fine, but parsing $with_utf8_bom dies with error : Config::General: EndBlock "</foo>" has no StartBlock statement (level: 1, chunk 3)! ========================== use Config::General; use Data::Dumper; my $plain = <<__EOCONF__; <foo> \x{e9}=\x{ef} </foo> __EOCONF__ my $with_utf8_bom = "\x{ef}\x{bb}\x{bf}" . $plain; for my $content ($plain, $with_utf8_bom) { open my $fh, "<", \$content; my $parser = Config::General->new(-ConfigFile => $fh, -UTF8 => 1); my %conf = $parser->getall; print STDERR Dumper \%conf; }
ok, I see. I added a fix, could you try the latest SVN if it works for you as well? - Tom
Subject: Re: [rt.cpan.org #113671] recognize BOM at start of a utf8 file
Date: Wed, 13 Apr 2016 22:55:04 +0200
To: bug-Config-General [...] rt.cpan.org
From: Laurent Dami <ldami [...] bluewin.ch>
Le 12.04.2016 09:22, T. Linden via RT a écrit : Show quoted text
> I added a fix, could you try the latest SVN if it works for you as well? >
The fix works for a fake file in a string, as supplied in my previous example. It also works for a real file, *without* the -UTF8 => 1 option : the BOM is properly removed ... but then the file isn't parsed as UTF8, which of course is not what we want. When opening a real file *with* -UTF8 => 1 turned on, the fix doesn't work. This is because first the file is opened with "<:utf8" layer, and *then* the first line is read .. but at this point the :utf8 PerlIO layer interprets the BOM and changes it into \x{FEFF}. Furthermore, the proposed fix could possibly alter other lines after the initial line. It's probably harmless, but nevertheless there is a small risk of corrupting the data. So I think the BOM detection should rather happen in _open() instead of _parse(). I just found the module File::BOM on CPAN, I have no experience with it, but reading the doc it looks like it would be quite helpful here. Hope this helps ... and thanks for your reactivity to the bug declaration :-) Laurent D.
What a mess. From my point of view the best would be to revert the proposed fix and keep Config::General as is. Instead you should handle it. I'm with perl monks (http://www.perlmonks.org/?node_id=599772): Show quoted text
> "!" in an ASCII file is also valid. But if you place a "!" at > the start of your Perl program, it probably will not compile. > It is a malformed file, not from a UNICODE perspective, but > from your parser's perspective.
In our case, the parser doesn't handle a BOM, since not required. It would also go a little bit too far to intruduce a new dependency just for such a rare case. So, this is, how you could do it: use File::Bom; use Config::General; use FileHandle; my $fd = FileHandle->new(); open_bom($fd, $file, ':utf8'); my $cfg = Config::General->new(-ConfigFile => $fd, [..]); This works because Config::General supports Filehandles as parameter to -ConfigFile. So, File::Bom can handle the issue while Config::General still sees what it expects. - Tom
Subject: Re: [rt.cpan.org #113671] recognize BOM at start of a utf8 file
Date: Sat, 16 Apr 2016 23:09:30 +0200
To: bug-Config-General [...] rt.cpan.org
From: Laurent Dami <laurent.dami [...] free.fr>
Le 14.04.2016 20:27, T. Linden via RT a écrit : Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=113671 > > > What a mess. > > From my point of view the best would be to revert the proposed fix and keep Config::General as is. Instead you should handle it. I'm with perl monks (http://www.perlmonks.org/?node_id=599772): > >
Hi, I did some research to see what other modules do : - YAML::Tiny removes the BOM _after_ reading with <:utf8 by doing|$string| |=~ s/^\x{FEFF}//| ; # where |\x{FEFF} is the utf8-decoded version of "\x{ef}\x{bb}\x{bf}" | ||||- YAML.pm does not handle the BOM - YAML::XS (file "reader.c") checks the BOM; if found it adds an offset to the raw buffer pointer and sets the proper encoding for the parser - YAML::Syck ignores the BOM - JSON::XS ignores the BOM - Template::Provider slurps the entire file, looks at the BOM, removes it if found, and then calls Encode::decode($encoding ...) - AppConfig::File has no support for Unicode So I agree, it's a mess! Nevertheless some of those do handle the BOM and it would be nice if Config::General could do it too. Regarding your answer : leaving the responsability to the client would only work with one single file; if there are some <<include>> statements, then the client has no control on how those included files will be opened, so it can only be handled properly by the module itself. I rolled up my sleeves and here is a proposed patch with also some test cases. Let me know what you think. Best regards, Laurent D.
Download utf8_bom_t.zip
application/octet-stream 680b

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

On Sat Apr 16 17:09:47 2016, laurent.dami@free.fr wrote: Show quoted text
> So I agree, it's a mess! Nevertheless some of those do handle the BOM > and it would be nice if Config::General could do it too.
Agreed. Show quoted text
> Regarding your answer : leaving the responsability to the client would > only work with one single file; if there are some <<include>> > statements, then the client has no control on how those included files > will be opened, so it can only be handled properly by the module > itself.
Indeed, I didn't think of includes :) Show quoted text
> I rolled up my sleeves and here is a proposed patch with also some > test cases. Let me know what you think.
Looks good, but those are not included in the patch: +t/utf8_bom.t +t/utf8_bom/bar.cfg +t/utf8_bom/foo.cfg Thanks, Tom
On Mon Apr 18 04:48:35 2016, TLINDEN wrote: Show quoted text
> Looks good, but those are not included in the patch:
My bad, they are contained in the zip file. - Tom
Ok then. I applied the patch (slightly modified), since it works like a charm. Thanks a lot! - Tom