Subject: | Spreadsheet::ParseExcel::Utility::xls2csv |
Date: | Mon, 1 Dec 2008 01:09:39 -0800 (PST) |
To: | bug-Spreadsheet-ParseExcel [...] rt.cpan.org |
From: | Fredrik Linde <blueboy.geo [...] yahoo.com> |
Hi!
I'm using Spreadsheet::ParseExcel::Utility::xls2csv together with Text::CSV_XS getline function. I have discovered that the xls2csv implementation is only dumping the content of a cell between two commas. More appropriate would be to follow a CSV grammar so the extracted data can be used with other modules like Text::CSV_XS.
I have quoted the two sources I have used for my implementation, and I have not looked in to the "header = name *(COMMA name)" rule but I think it will work in an sufficient way.
Best Regards
/Fredrik Linde
"CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
* Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, line-breaks can be embedded.
* Fields are separated by commas.
* Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde). In binary mode all characters are accepted, at least in quoted fields.
* A field within CSV must be surrounded by double-quotes to contain a the separator character (comma)." -http://search.cpan.org/~hmbrand/Text-CSV_XS-0.58/CSV_XS.pm
"2. Definition of the CSV Format
While there are various specifications and implementations for the
CSV format (for ex. [4], [5], [6] and [7]), there is no formal
specification in existence, which allows for a wide variety of
interpretations of CSV files. This section documents the format that
seems to be followed by most implementations:
1. Each record is located on a separate line, delimited by a line
break (CRLF). For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
2. The last record in the file may or may not have an ending line
break. For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx
3. There maybe an optional header line appearing as the first line
of the file with the same format as normal record lines. This
header will contain names corresponding to the fields in the file
and should contain the same number of fields as the records in
the rest of the file (the presence or absence of the header line
should be indicated via the optional "header" parameter of this
MIME type). For example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
Shafranovich Informational [Page 2]
RFC 4180 Common Format and MIME Type for CSV Files October 2005
4. Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma. For example:
aaa,bbb,ccc
5. Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
6. Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
" -http://tools.ietf.org/html/rfc4180#section-2
Message body is not shown because sender requested not to inline it.