Bug #130048 for Text-CSV-Hashify: Text::CSV::Hashify feature request

Thu Jul 11 09:47:59 2019 guido [...] gvr.org - Ticket created

Subject:	Text::CSV::Hashify feature request
Date:	Thu, 11 Jul 2019 15:38:49 +0200
To:	bug-Text-CSV-Hashify [...] rt.cpan.org
From:	Guido van Rooij <guido [...] gvr.org>

Hi, First of all, thanks for a nice package! I do have a request. Currently it is not possible to process a CSV file with a Byte Order Marker (BOM), at least that is how I understand it. In Text::CSV this is possible but only in the header method which is not used in Text::CSV::Hashify. So my request is to add an extra attribute to the new() method (detect_bom) that does nothing more than check after the first call to getline() to get the headers, if a BOM is present in the first element of @{$header_ref}. Does that seem a good addition? If so, I can come up with a patch. If not: what would be your solution? Thanks, -Guido

Thu Jul 11 12:17:28 2019 jkeenan [...] cpan.org - Correspondence added

On Thu Jul 11 09:47:59 2019, guido@gvr.org wrote: Show quoted text

> Hi, > > First of all, thanks for a nice package! > > I do have a request. Currently it is not possible to process a CSV > file with a Byte Order Marker (BOM), at least that is how I understand > it. > In Text::CSV this is possible but only in the header method which is > not used in Text::CSV::Hashify. > > So my request is to add an extra attribute to the new() method > (detect_bom) that does nothing more than check after the first call to > getline() to get the headers, if a BOM is present in the first element of > @{$header_ref}. > > Does that seem a good addition? If so, I can come up with a patch. > If not: what would be your solution? > > Thanks, > > -Guido

Would you be able to *attach* to this ticket a CSV file (containing non-confidential data) that begins with a BOM? We would need something like this to write a regression test. Thank you very much. Jim Keenan

Thu Jul 11 12:17:29 2019 The RT System itself - Status changed from 'new' to 'open'

Fri Jul 12 03:27:25 2019 guido [...] gvr.org - Correspondence added

Subject:	Re: [rt.cpan.org #130048] Text::CSV::Hashify feature request
Date:	Fri, 12 Jul 2019 09:27:13 +0200
To:	James E Keenan via RT <bug-Text-CSV-Hashify [...] rt.cpan.org>
From:	Guido van Rooij <guido [...] gvr.org>

On Thu, Jul 11, 2019 at 12:17:29PM -0400, James E Keenan via RT wrote: Show quoted text

> > Would you be able to *attach* to this ticket a CSV file (containing non-confidential data) that begins with a BOM? We would need something like this to write a regression test.

I've attached it to this message. The byte sequence is 0xef,0xbb,0xbf which is UTF-8 for UTF-16 U+FEFF. When read by Text::CSV::Hashify, the first column column name is prepended with the UTF-16 sequence. -Guido

Message body is not shown because sender requested not to inline it.

Tue Jul 16 13:57:40 2019 jkeenan [...] pobox.com - Correspondence added

CC:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>, Ishigaki [...] cpan.org, makamaka [...] cpan.org
Subject:	Re: [rt.cpan.org #130048] Text::CSV::Hashify feature request
Date:	Tue, 16 Jul 2019 13:57:27 -0400
To:	bug-Text-CSV-Hashify [...] rt.cpan.org
From:	James E Keenan <jkeenan [...] pobox.com>

On 7/11/19 9:48 AM, Guido van Rooij via RT wrote: Show quoted text

> Thu Jul 11 09:47:59 2019: Request 130048 was acted upon. > Transaction: Ticket created by guido@gvr.org > Queue: Text-CSV-Hashify > Subject: Text::CSV::Hashify feature request > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: guido@gvr.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=130048 > > > > Hi, > > First of all, thanks for a nice package! > > I do have a request. Currently it is not possible to process a CSV > file with a Byte Order Marker (BOM), at least that is how I understand > it. > In Text::CSV this is possible but only in the header method which is > not used in Text::CSV::Hashify. > > So my request is to add an extra attribute to the new() method > (detect_bom) that does nothing more than check after the first call to > getline() to get the headers, if a BOM is present in the first element of > @{$header_ref}. > > Does that seem a good addition? If so, I can come up with a patch. > If not: what would be your solution? > > Thanks, > > -Guido >

I looked into implementing this feature today and saw that it would not be as straightforward as I had hoped. When I wrote Text::CSV::Hashify several years ago, Text::CSV and Text::CSV_XS apparently did not have BOM-detection functionality. So I never had to write tests for Text::CSV::Hashify to demonstrate the possibility of "passing BOM-detection through" to Text::CSV. But I see now that there has been considerable new functionality added to those two modules which Hashify cannot yet handle. Part of the difficulty is that, AFAICT, 'detect_bom' is implemented as a key-value pair in a hash passed as the second argument to $csv->header(): ##### $csv->header($IN, { detect_bom => 1, other_attribute => 0 }); ##### Moreover, if I read the documentation correctly, 'detect_bom' is set to true by default. Now, Text::CSV::Hashify has never used the header() method internally. Text::CSV::Hashify::new(): * examines all its arguments; * processes, then sets aside, those not needed for Text::CSV_XS::new(); * calls the latter new(); * then uses $csv->getline() to read the first row *without immediately treating it as a "header" row*; * does some QA on the first row; * uses $csv->column_names() to set the column names. Text::CSV::header(), which I don't use, appears to be a multi-purpose method. It can override values passed to Text::CSV::new() for, e.g., 'sep_char' and it turns BOM-detection on unless you specifically request to turn it off. So it is a method with multiple side effects, which makes it potentially dangerous -- or, at the very least, there is a fair chance that it will not DWIM. So I actually like the fact that Text::CSV::Hashify currently does *not* use header() -- even though that's what prevents BOM-detection. I am afraid that if we include a call to $csv->header() somewhere inside Text::CSV::Hashify::new(), then attributes of the Text::CSV_XS object inside the Text::CSV::Hashify object would change in unforeseeable ways. If I were implementing BOM-detection for Text::CSV, I would not have it turned on by default and I would have passed it to new() rather than header(). That would have facilitated subclassing. I am cc-ing the maintainers of Text::CSV and Text::CSV::Hashify for their thoughts on this. Thank you very much. Jim Keenan

Tue Jul 16 15:49:21 2019 h.m.brand [...] xs4all.nl - Correspondence added

CC:	bug-Text-CSV-Hashify [...] rt.cpan.org, Ishigaki [...] cpan.org, makamaka [...] cpan.org
Subject:	Re: [rt.cpan.org #130048] Text::CSV::Hashify feature request
Date:	Tue, 16 Jul 2019 21:48:57 +0200
To:	James E Keenan <jkeenan [...] pobox.com>
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Tue, 16 Jul 2019 13:57:27 -0400, James E Keenan <jkeenan@pobox.com> wrote: Show quoted text

> I am cc-ing the maintainers of Text::CSV and Text::CSV::Hashify for > their thoughts on this.

$csv->header (...) is indeed a method that combines several "features" into a single call, because separating them would only complicate the tasks at hand. In my daily work, most CSV files have a "header" row, which defines the column names. This row usually has the same number of fields as the rest of the data and "most of the time" (TM) these fields do not have newlines. I hope no one in a sane mindset will ever try to add a header line where the fields have embedded newlines, but we will probably meet someone who does just to make life of others more difficult (or it is a manager who thinks that a full description of the column makes more sense, but those people should be forbidden to touch a keyboard. Though Unicode permits Byte Order Marks halfway a stream, it is very uncommon. Where one could see it is if two stream that both have a BOM are catenated, but for each stream, a BOM has to be the first byte sequence of the stream, which coincidentally is nice to deal with on the first (read header-) line. So dealing with BOM, which *also* has impact on the content of the fields in the first row, together with reading this line as header *and* (auto)detect the line ending and (optionally) field separator makes all the sense in the world (at least to me). Note that reading the first row in UTF-16 or UTF-32 and *not* looking at the BOM, may result in bytes running over to the next line/record invalidating it. I hope I chose the defaults to be the most DWIM: • BOM detection: true • Set column names for hash treatment • Map column names to lower case The last one is based on most of (my) CSV files being table-exports or selections from databases, and ANSI tells the field names to be case insensitive by default. This makes the use of Oracle exports on PostgreSQL imports easier. Each of those options can be overruled and/or abbreviated. I personally think the ->header method is now stable and unlikely to change. *if* new options are added, those will have to be backward compatible. Te options that are currently in, are there since the first commit. The only thing that visibly changed is the addition of some abbreviations, specifically "munge" for "munge_column_names". I *do* see your point in ->header not honoring the setting of sep or sep_char in ->new, but if you know that the new call had this attr define, you could force it to be used in ->header my $sep = "\t"; my $csv = Text::CSV_XS->new ({ sep_char => $sep }); my @hdr = $csh->header ($fh, [ $sep ], { munge => "lc" }); as now $sep is the only allowed separator to be detected, it can not be (re)set to anything else. Most of what I read in the Text::CSV::Hashify manual is relatively easy to do in Text::CSV_XS (and Text::CSV). I'd use the csv function use Text::CSV::Hashify; $obj = Text::CSV::Hashify->new ({ file => "/path/to/file.csv", format => "hoh", # hash of hashes, which is default key => "id", # needed except when format is 'aoh' max_rows => 20, # number of records to read; defaults to all # ... other key-value pairs as appropriate from Text::CSV }); $hash_ref = $obj->all; => use Text::CSV_XS "csv"; $hash_ref = csv ( in => "path/to/file", bom => 1, # Implies ->header call for BOM detection and header key => "id", fragment => "row=1-20", # ... other key-value pairs for Text::CSV_XS ); (where I must add the remark that in the *current* implementation, the combination of both key and fragment attribute will cause the key attribute to be ignored, which I consider a bug. Please file a ticket if you agree :) Does this all make sense? -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.29 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Download (untitled)
application/pgp-signature 488b

Message body not shown because it is not plain text.

Thu Aug 08 07:17:46 2019 guido [...] gvr.org - Correspondence added

Subject:	Re: [rt.cpan.org #130048] Text::CSV::Hashify feature request
Date:	Thu, 8 Aug 2019 13:17:20 +0200
To:	"h.m.brand [...] xs4all.nl via RT" <bug-Text-CSV-Hashify [...] rt.cpan.org>
From:	Guido van Rooij <guido [...] gvr.org>

On Tue, Jul 16, 2019 at 03:49:22PM -0400, h.m.brand@xs4all.nl via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=130048 > > > On Tue, 16 Jul 2019 13:57:27 -0400, James E Keenan <jkeenan@pobox.com> > wrote: >

> > I am cc-ing the maintainers of Text::CSV and Text::CSV::Hashify for > > their thoughts on this.

> > $csv->header (...) is indeed a method that combines several "features" > into a single call, because separating them would only complicate the > tasks at hand. > > In my daily work, most CSV files have a "header" row, which defines > the column names. This row usually has the same number of fields as > the rest of the data and "most of the time" (TM) these fields do not > have newlines. I hope no one in a sane mindset will ever try to add > a header line where the fields have embedded newlines, but we will > probably meet someone who does just to make life of others more > difficult (or it is a manager who thinks that a full description of > the column makes more sense, but those people should be forbidden to > touch a keyboard. > > Though Unicode permits Byte Order Marks halfway a stream, it is very > uncommon. Where one could see it is if two stream that both have a > BOM are catenated, but for each stream, a BOM has to be the first > byte sequence of the stream, which coincidentally is nice to deal > with on the first (read header-) line. > > So dealing with BOM, which *also* has impact on the content of the > fields in the first row, together with reading this line as header > *and* (auto)detect the line ending and (optionally) field separator > makes all the sense in the world (at least to me). Note that reading > the first row in UTF-16 or UTF-32 and *not* looking at the BOM, may > result in bytes running over to the next line/record invalidating it. > > I hope I chose the defaults to be the most DWIM: > > • BOM detection: true > • Set column names for hash treatment > • Map column names to lower case > > The last one is based on most of (my) CSV files being table-exports > or selections from databases, and ANSI tells the field names to be > case insensitive by default. This makes the use of Oracle exports on > PostgreSQL imports easier. Each of those options can be overruled > and/or abbreviated. > > I personally think the ->header method is now stable and unlikely to > change. *if* new options are added, those will have to be backward > compatible. Te options that are currently in, are there since the > first commit. The only thing that visibly changed is the addition of > some abbreviations, specifically "munge" for "munge_column_names". > > I *do* see your point in ->header not honoring the setting of sep or > sep_char in ->new, but if you know that the new call had this attr > define, you could force it to be used in ->header > > my $sep = "\t"; > my $csv = Text::CSV_XS->new ({ sep_char => $sep }); > my @hdr = $csh->header ($fh, [ $sep ], { munge => "lc" }); > > as now $sep is the only allowed separator to be detected, it can not > be (re)set to anything else. > > Most of what I read in the Text::CSV::Hashify manual is relatively > easy to do in Text::CSV_XS (and Text::CSV). I'd use the csv function > > use Text::CSV::Hashify; > $obj = Text::CSV::Hashify->new ({ > file => "/path/to/file.csv", > format => "hoh", # hash of hashes, which is default > key => "id", # needed except when format is 'aoh' > max_rows => 20, # number of records to read; defaults to all > # ... other key-value pairs as appropriate from Text::CSV > }); > $hash_ref = $obj->all; > > => > > use Text::CSV_XS "csv"; > $hash_ref = csv ( > in => "path/to/file", > bom => 1, # Implies ->header call for BOM detection and header > key => "id", > fragment => "row=1-20", > # ... other key-value pairs for Text::CSV_XS > ); > > (where I must add the remark that in the *current* implementation, the > combination of both key and fragment attribute will cause the key > attribute to be ignored, which I consider a bug. Please file a ticket > if you agree :) > > Does this all make sense?

Hi, Sorry for the delay (holidays!). This makes sense to me, but I cannot get it to work. I had the following working code (working when no BOM is present: $obj = Text::CSV::Hashify->new( { file => $csv, format => 'aoh' # array of hashes } ); return $obj->all; and changed it to: return csv( { in => $csv, detect_bom => 1, headers => "auto" } ); But I get the following error: INI - the header contains an empty field at <somehwere> This was triggered by the following code: my $ref = ref $hdrs ? # aoh do { my @h = $csv->column_names ($hdrs); my %h; $h{$_}++ for @h; exists $h{""} and croak ($csv->SetDiag (1012)); Here is a smaple csv that will trigger the error: Header1,Header2,Header3 val1,,val3 I tried to see what's going on but I guess I don't really understand the code. -Guido

Fri Aug 09 03:06:24 2019 guido [...] gvr.org - Correspondence added

Subject:	Re: [rt.cpan.org #130048] Text::CSV::Hashify feature request
Date:	Fri, 9 Aug 2019 09:06:15 +0200
To:	"h.m.brand [...] xs4all.nl via RT" <bug-Text-CSV-Hashify [...] rt.cpan.org>
From:	Guido van Rooij <guido [...] gvr.org>

On Thu, Aug 08, 2019 at 01:17:21PM +0200, Guido van Rooij wrote: Show quoted text

> > I tried to see what's going on but I guess I don't really understand the code. >

The problem seems to be the fact that when the bom_detect option is set, the following code is executed: @row1 = $csv->header ($fh, \%harg); my @hdr = $csv->column_names; @hdr and $hdrs ||= \@hdr; so in this piece of code, the header is read in. Yet in this code: elsif ($hdrs eq "auto") { my $h = $csv->getline ($fh) or return; $hdrs = [ map { $hdr{$_} || $_ } @$h ]; } anothre attempt is done to read in the headers. In this case, the getline call actually reads in the second line of the csv file. Note that this code: defined $c->{hd_d} and $harg{detect_bom} = $c->{hd_b}; is wrong and should read defined $c->{hd_b} and $harg{detect_bom} = $c->{hd_b}; -Guido