On Tue, 16 Jul 2019 13:57:27 -0400, James E Keenan <jkeenan@pobox.com>
wrote:
Show quoted text> I am cc-ing the maintainers of Text::CSV and Text::CSV::Hashify for
> their thoughts on this.
$csv->header (...) is indeed a method that combines several "features"
into a single call, because separating them would only complicate the
tasks at hand.
In my daily work, most CSV files have a "header" row, which defines
the column names. This row usually has the same number of fields as
the rest of the data and "most of the time" (TM) these fields do not
have newlines. I hope no one in a sane mindset will ever try to add
a header line where the fields have embedded newlines, but we will
probably meet someone who does just to make life of others more
difficult (or it is a manager who thinks that a full description of
the column makes more sense, but those people should be forbidden to
touch a keyboard.
Though Unicode permits Byte Order Marks halfway a stream, it is very
uncommon. Where one could see it is if two stream that both have a
BOM are catenated, but for each stream, a BOM has to be the first
byte sequence of the stream, which coincidentally is nice to deal
with on the first (read header-) line.
So dealing with BOM, which *also* has impact on the content of the
fields in the first row, together with reading this line as header
*and* (auto)detect the line ending and (optionally) field separator
makes all the sense in the world (at least to me). Note that reading
the first row in UTF-16 or UTF-32 and *not* looking at the BOM, may
result in bytes running over to the next line/record invalidating it.
I hope I chose the defaults to be the most DWIM:
• BOM detection: true
• Set column names for hash treatment
• Map column names to lower case
The last one is based on most of (my) CSV files being table-exports
or selections from databases, and ANSI tells the field names to be
case insensitive by default. This makes the use of Oracle exports on
PostgreSQL imports easier. Each of those options can be overruled
and/or abbreviated.
I personally think the ->header method is now stable and unlikely to
change. *if* new options are added, those will have to be backward
compatible. Te options that are currently in, are there since the
first commit. The only thing that visibly changed is the addition of
some abbreviations, specifically "munge" for "munge_column_names".
I *do* see your point in ->header not honoring the setting of sep or
sep_char in ->new, but if you know that the new call had this attr
define, you could force it to be used in ->header
my $sep = "\t";
my $csv = Text::CSV_XS->new ({ sep_char => $sep });
my @hdr = $csh->header ($fh, [ $sep ], { munge => "lc" });
as now $sep is the only allowed separator to be detected, it can not
be (re)set to anything else.
Most of what I read in the Text::CSV::Hashify manual is relatively
easy to do in Text::CSV_XS (and Text::CSV). I'd use the csv function
use Text::CSV::Hashify;
$obj = Text::CSV::Hashify->new ({
file => "/path/to/file.csv",
format => "hoh", # hash of hashes, which is default
key => "id", # needed except when format is 'aoh'
max_rows => 20, # number of records to read; defaults to all
# ... other key-value pairs as appropriate from Text::CSV
});
$hash_ref = $obj->all;
=>
use Text::CSV_XS "csv";
$hash_ref = csv (
in => "path/to/file",
bom => 1, # Implies ->header call for BOM detection and header
key => "id",
fragment => "row=1-20",
# ... other key-value pairs for Text::CSV_XS
);
(where I must add the remark that in the *current* implementation, the
combination of both key and fragment attribute will cause the key
attribute to be ignored, which I consider a bug. Please file a ticket
if you agree :)
Does this all make sense?
--
H.Merijn Brand
http://tux.nl Perl Monger
http://amsterdam.pm.org/
using perl5.00307 .. 5.29 porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/