This queue is for tickets about the Spreadsheet-ParseExcel CPAN distribution.
Maintainer(s)' notes
If you are reporting a bug in Spreadsheet::ParseExcel here are some pointers
1) State the issues as clearly and as concisely as possible. A simple program or Excel test file (see below) will often explain the issue better than a lot of text.
2) Provide information on your system, version of perl and module versions. The following program will generate everything that is required. Put this information in your bug report.
#!/usr/bin/perl -w
print "\n Perl version : $]";
print "\n OS name : $^O";
print "\n Module versions: (not all are required)\n";
my @modules = qw(
Spreadsheet::ParseExcel
Scalar::Util
Unicode::Map
Spreadsheet::WriteExcel
Parse::RecDescent
File::Temp
OLE::Storage_Lite
IO::Stringy
);
for my $module (@modules) {
my $version;
eval "require $module";
if (not $@) {
$version = $module->VERSION;
$version = '(unknown)' if not defined $version;
}
else {
$version = '(not installed)';
}
printf "%21s%-24s\t%s\n", "", $module, $version;
}
__END__
3) Upgrade to the latest version of Spreadsheet::ParseExcel (or at least test on a system with an upgraded version). The issue you are reporting may already have been fixed.
4) Create a small example program that demonstrates your problem. The program should be as small as possible. A few lines of codes are worth tens of lines of text when trying to describe a bug.
5) Supply an Excel file that demonstrates the problem. This is very important. If the file is big, or contains confidential information, try to reduce it down to the smallest Excel file that represents the issue. If you don't wish to post a file here then send it to me directly: jmcnamara@cpan.org
6) Say if the test file was created by Excel, OpenOffice, Gnumeric or something else. Say which version of that application you used.
7) If you are submitting a patch you should check with the maintainer whether the issue has already been patched or if a fix is in the works. Patches should be accompanied by test cases.
Asking a question
If you would like to ask a more general question there is the Spreadsheet::ParseExcel Google Group.
Owner: |
Nobody in particular
|
Requestors: |
ovid [...] cpan.org
|
Cc: |
|
AdminCc: |
|
|
Severity: |
Normal |
Broken in: |
(no value)
|
Fixed in: |
(no value)
|
|
Thu Dec 06 04:19:36 2012
ovid [...] cpan.org - Ticket created
Problem:
$cell->value() correctly handles UTF-8 data but $cell->unformatted()
does not.
Steps to reproduce:
1. Create a spreadsheet and in cell A1 enter the following text: "мой
первый медиаплана" (without the quotes). Save it as utf8.xls
2. Read this spreadsheet with the following program:
use 5.10.0;
use warnings;
binmode STDOUT, ':encoding(UTF-8)'; # or use utf8::all
use Spreadsheet::ParseExcel;
my $workbook = Spreadsheet::ParseExcel->new->parse('utf8.xls');
my @worksheets = $workbook->worksheets;
my $cell = $worksheets[0]->get_cell( 0, 0 );
say "Value = ", $cell->value();
say "Unformatted = ", $cell->unformatted();
The output on my machine is as follows:
Value = мой первый медиаплана
Unformatted = <>9 ?5@2K9 <5480?;0=0
Extra information:
I have a workaround for this, but I've attached a test script and an
Excel file which demonstrates the problem. The Excel file was created
with LibreOffice Calc, but I've observed this behavior with spreadsheets
created with Microsoft Excel.
Also:
Perl version : 5.012002
OS name : linux
Module versions:
Spreadsheet::ParseExcel 0.59
Scalar::Util 1.23
Unicode::Map 0.112
Spreadsheet::WriteExcel 2.37
Parse::RecDescent 1.967006
File::Temp 0.22
OLE::Storage_Lite 0.19
IO::Stringy 2.110
Cheers,
Ovid
use 5.10.0;
use warnings;
binmode STDOUT, ':encoding(UTF-8)'; # or use utf8::all
use Spreadsheet::ParseExcel;
my $workbook = Spreadsheet::ParseExcel->new->parse('utf8.xls');
my @worksheets = $workbook->worksheets;
my $cell = $worksheets[0]->get_cell( 0, 0 );
say "Value = ", $cell->value();
say "Unformatted = ", $cell->unformatted();
say "Perl version : $]";
say "OS name : $^O";
say "Module versions: (not all are required)\n";
my @modules = qw(
Spreadsheet::ParseExcel
Scalar::Util
Unicode::Map
Spreadsheet::WriteExcel
Parse::RecDescent
File::Temp
OLE::Storage_Lite
IO::Stringy
);
for my $module (@modules) {
my $version;
eval "require $module";
if ( not $@ ) {
$version = $module->VERSION;
$version = '(unknown)' if not defined $version;
}
else {
$version = '(not installed)';
}
printf "%21s%-24s\t%s\n", "", $module, $version;
}
Message body not shown because it is not plain text.
Thu Dec 06 05:01:34 2012
jmcnamara [...] cpan.org - Correspondence added
On Thu Dec 06 04:19:36 2012, OVID wrote:
Show quoted text> $cell->value() correctly handles UTF-8 data but $cell->unformatted()
> does not.
Hi Ovid,
Thanks for the detailed bug report.
This is expected behaviour (although clearly you didn't expected it).
The unformatted function returns the raw data stored in Excel. It is
used 99% of the time to get unformatted numeric data but for strings it
returns the raw byte stream. In your case that is most likely UTF8-16LE
but there are also some other, rarer, far-east encodings that the
original author was interested in.
I should probably update the docs on the unformatted method to explain
the behaviour with strings.
I've I've missed the issue here or if you have any other issues let me
know.
Regards,
John.
Thu Dec 06 05:01:36 2012
The RT System itself - Status changed from 'new' to 'open'
Mon Feb 11 06:33:03 2013
EDAVIS [...] cpan.org - Correspondence added
This isn't really expected behaviour from the documentation, which says
(in Spreadsheet::ParseExcel::Cell)
In general Spreadsheet::ParseExcel will return all character strings
in UTF-8 regardless of the encoding used by Excel.
Then the documentation for unformatted() says only that it "returns the
cell value without a numeric format".
If it is really intended that unformatted() should return raw bytes, it
would be better to call it unformatted_bytes() or something like that.
It would also be useful to have an unformatted_chars() method which
does what the documentation currently says: return the value of the cell
without numeric formatting applied, as a character string in UTF-8.
Thu Mar 06 12:10:36 2014
DOUGW [...] cpan.org - Correspondence added
On Mon Feb 11 06:33:03 2013, EDAVIS wrote:
Show quoted text> This isn't really expected behaviour from the documentation, which says
> (in Spreadsheet::ParseExcel::Cell)
>
> In general Spreadsheet::ParseExcel will return all character strings
> in UTF-8 regardless of the encoding used by Excel.
>
> Then the documentation for unformatted() says only that it "returns the
> cell value without a numeric format".
>
> If it is really intended that unformatted() should return raw bytes, it
> would be better to call it unformatted_bytes() or something like that.
> It would also be useful to have an unformatted_chars() method which
> does what the documentation currently says: return the value of the cell
> without numeric formatting applied, as a character string in UTF-8.
My $0.02 is that yes, unformatted() should have been called something else, perhaps unencoded() or raw() (and maybe I'll make an alias to that effect), but the original author probably thought of encoding as part of formatting (after all the routine that does the conversion from raw bytes to encoded characters is called TextFmt, and it didn't even handle unicode correctly until recently).
I, like many others, use this module just to scrape data, so I understand the hassle of having to go to value() for the text (although I often go to unformatted() for everything since I don't get much unicode), and unformatted() for the numbers, and using ExcelFmt() on numbers that are dates (or depending on the unpredictable format you get from value())...it would be nice to have one method that gives you the encoded text, unformatted number, and a date in a standard date format (e.g. YYYY-MM-DD HH::MM::SS.FFF, and maybe just YYYY-MM-DD for numbers w/o a decimal part). Since distinguishing between number and date is somewhat of a guess, we can expect to get that wrong in some corner case, but I think it should be okay most of the time. I propose a new cell method data() for this...and leaving everything else as is, but improving the documentation as to what value() and unformatted() mean.