Bug #81737 for Spreadsheet-ParseExcel: $cell->unformatted() does not handle UTF-8 correctly

Thu Dec 06 04:19:36 2012 ovid [...] cpan.org - Ticket created

Subject:

$cell->unformatted() does not handle UTF-8 correctly

Problem: $cell->value() correctly handles UTF-8 data but $cell->unformatted() does not. Steps to reproduce: 1. Create a spreadsheet and in cell A1 enter the following text: "мой первый медиаплана" (without the quotes). Save it as utf8.xls 2. Read this spreadsheet with the following program: use 5.10.0; use warnings; binmode STDOUT, ':encoding(UTF-8)'; # or use utf8::all use Spreadsheet::ParseExcel; my $workbook = Spreadsheet::ParseExcel->new->parse('utf8.xls'); my @worksheets = $workbook->worksheets; my $cell = $worksheets[0]->get_cell( 0, 0 ); say "Value = ", $cell->value(); say "Unformatted = ", $cell->unformatted(); The output on my machine is as follows: Value = мой первый медиаплана Unformatted = <>9 ?5@2K9 <5480?;0=0 Extra information: I have a workaround for this, but I've attached a test script and an Excel file which demonstrates the problem. The Excel file was created with LibreOffice Calc, but I've observed this behavior with spreadsheets created with Microsoft Excel. Also: Perl version : 5.012002 OS name : linux Module versions: Spreadsheet::ParseExcel 0.59 Scalar::Util 1.23 Unicode::Map 0.112 Spreadsheet::WriteExcel 2.37 Parse::RecDescent 1.967006 File::Temp 0.22 OLE::Storage_Lite 0.19 IO::Stringy 2.110 Cheers, Ovid

Subject:

xls.pl

use 5.10.0; use warnings; binmode STDOUT, ':encoding(UTF-8)'; # or use utf8::all use Spreadsheet::ParseExcel; my $workbook = Spreadsheet::ParseExcel->new->parse('utf8.xls'); my @worksheets = $workbook->worksheets; my $cell = $worksheets[0]->get_cell( 0, 0 ); say "Value = ", $cell->value(); say "Unformatted = ", $cell->unformatted(); say "Perl version : $]"; say "OS name : $^O"; say "Module versions: (not all are required)\n"; my @modules = qw( Spreadsheet::ParseExcel Scalar::Util Unicode::Map Spreadsheet::WriteExcel Parse::RecDescent File::Temp OLE::Storage_Lite IO::Stringy ); for my $module (@modules) { my $version; eval "require $module"; if ( not $@ ) { $version = $module->VERSION; $version = '(unknown)' if not defined $version; } else { $version = '(not installed)'; } printf "%21s%-24s\t%s\n", "", $module, $version; }

Subject:

utf8.xls

Download utf8.xls
application/vnd.ms-excel 5.5k

Message body not shown because it is not plain text.

Thu Dec 06 05:01:34 2012 jmcnamara [...] cpan.org - Correspondence added

On Thu Dec 06 04:19:36 2012, OVID wrote: Show quoted text

> $cell->value() correctly handles UTF-8 data but $cell->unformatted() > does not.

Hi Ovid, Thanks for the detailed bug report. This is expected behaviour (although clearly you didn't expected it). The unformatted function returns the raw data stored in Excel. It is used 99% of the time to get unformatted numeric data but for strings it returns the raw byte stream. In your case that is most likely UTF8-16LE but there are also some other, rarer, far-east encodings that the original author was interested in. I should probably update the docs on the unformatted method to explain the behaviour with strings. I've I've missed the issue here or if you have any other issues let me know. Regards, John.

Thu Dec 06 05:01:36 2012 The RT System itself - Status changed from 'new' to 'open'

Mon Feb 11 06:33:03 2013 EDAVIS [...] cpan.org - Correspondence added

This isn't really expected behaviour from the documentation, which says (in Spreadsheet::ParseExcel::Cell) In general Spreadsheet::ParseExcel will return all character strings in UTF-8 regardless of the encoding used by Excel. Then the documentation for unformatted() says only that it "returns the cell value without a numeric format". If it is really intended that unformatted() should return raw bytes, it would be better to call it unformatted_bytes() or something like that. It would also be useful to have an unformatted_chars() method which does what the documentation currently says: return the value of the cell without numeric formatting applied, as a character string in UTF-8.

Thu Mar 06 12:10:36 2014 DOUGW [...] cpan.org - Correspondence added

On Mon Feb 11 06:33:03 2013, EDAVIS wrote: Show quoted text

> This isn't really expected behaviour from the documentation, which says > (in Spreadsheet::ParseExcel::Cell) > > In general Spreadsheet::ParseExcel will return all character strings > in UTF-8 regardless of the encoding used by Excel. > > Then the documentation for unformatted() says only that it "returns the > cell value without a numeric format". > > If it is really intended that unformatted() should return raw bytes, it > would be better to call it unformatted_bytes() or something like that. > It would also be useful to have an unformatted_chars() method which > does what the documentation currently says: return the value of the cell > without numeric formatting applied, as a character string in UTF-8.

My $0.02 is that yes, unformatted() should have been called something else, perhaps unencoded() or raw() (and maybe I'll make an alias to that effect), but the original author probably thought of encoding as part of formatting (after all the routine that does the conversion from raw bytes to encoded characters is called TextFmt, and it didn't even handle unicode correctly until recently). I, like many others, use this module just to scrape data, so I understand the hassle of having to go to value() for the text (although I often go to unformatted() for everything since I don't get much unicode), and unformatted() for the numbers, and using ExcelFmt() on numbers that are dates (or depending on the unpredictable format you get from value())...it would be nice to have one method that gives you the encoded text, unformatted number, and a date in a standard date format (e.g. YYYY-MM-DD HH::MM::SS.FFF, and maybe just YYYY-MM-DD for numbers w/o a decimal part). Since distinguishing between number and date is somewhat of a guess, we can expect to get that wrong in some corner case, but I think it should be okay most of the time. I propose a new cell method data() for this...and leaving everything else as is, but improving the documentation as to what value() and unformatted() mean.

Bug #81737 for Spreadsheet-ParseExcel: $cell->unformatted() does not handle UTF-8 correctly

Preferred bug tracker

Maintainer(s)' notes