Skip Menu |

This queue is for tickets about the Spreadsheet-ParseExcel-Stream CPAN distribution.

Report information
The Basics
Id: 71138
Status: resolved
Priority: 0/
Queue: Spreadsheet-ParseExcel-Stream

People
Owner: Nobody in particular
Requestors: SHYOKOU [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.01
Fixed in: 0.02



Subject: Cells alignment in row array may break if null presents
In order to reduce memory usage, the upstream Spreadsheet::ParseExcel parser calls the cell handler call-back only when the cell has something. Thus both leading and in-between nulls may interfer the alignment of cells in the outcome row array, if the positions of cells are ignored. An example following:   original sheet w/ null properly aligned array array w/o alignment   |A1|B1|C1|D1|E1|F1|G1|H1| {A1|B1|C1|D1|E1|F1|G1|H1} {A1|B1|C1|D1|E1|F1|G1|H1} |  |  |C2|D2|E2|F2|G2|H2| {||C2|D2|E2|F2|G2|H2} {C2|D2|E2|F2|G2|H2} |  |  |  |  |E3|F3|G3|H3| {||||E3|F3|G3|H3} {E3|F3|G3|H3} |  |  |  |  |  |  |G4|H4| {||||||G4|H4} {G4|H4}   |  |B1|C1|D1|E1|F1|G1|H1| {B1|C1|D1|E1|F1|G1|H1} {B1|C1|D1|E1|F1|G1|H1} |  |  |  |D2|E2|F2|G2|H2| {||D2|E2|F2|G2|H2} {D2|E2|F2|G2|H2} |  |  |  |  |  |F3|G3|H3| {||||F3|G3|H3} {F3|G3|H3} |  |  |  |  |  |  |  |H4| {||||||H4} {H4}   |  |B1|C1|D1|E1|F1|G1|H1| {|B1|C1|D1|E1|F1|G1|H1} {B1|C1|D1|E1|F1|G1|H1} |  |  |  |D2|E2|F2|G2|H2| {|||D2|E2|F2|G2|H2} {D2|E2|F2|G2|H2} |A3|B3|  |  |  |F3|G3|H3| {A3|B3||||F3|G3|H3} {A3|B3|F3|G3|H3} |A4|B4|C4|D4|  |  |  |H4| {A4|B4|C4|D4||||H4} {A4|B4|C4|D4|H4}   The attached diff file tries to reuse the position of cell to build the row array.
Subject: Stream.pm.diff
--- Stream.pm 2011-07-30 03:06:14.000000000 +0800 +++ Stream.pm 2011-07-30 03:06:14.000000000 +0800 @@ -92,13 +92,22 @@ my $f = $self->{SUB}; # Initialize row with first cell +=begin comment may_misplace my @row = ($curr_cell); +=end comment may_misplace +=cut + my @row = (); + $row[ $curr_cell->[3] - ( $curr_cell->[0] -> worksheet( $curr_cell->[1] ) -> col_range )[0] ] = $curr_cell; my $nxt_cell = $f->(); # Collect current row on current worksheet while ( $nxt_cell && $nxt_cell->[1] == $curr_cell->[1] && $nxt_cell->[2] == $curr_cell->[2] ) { $curr_cell = $nxt_cell; +=begin comment may_misplace push @row, $curr_cell; +=end comment may_misplace +=cut + $row[ $curr_cell->[3] - ( $curr_cell->[0] -> worksheet( $curr_cell->[1] ) -> col_range )[0] ] = $curr_cell; $nxt_cell = $f->(); } $self->{NEXT_CELL} = $nxt_cell; @@ -112,7 +121,11 @@ my $row = $self->next_row(); return unless $row; } +=begin comment may_mistreat return [ map { $_->[4]->value() } @{$self->{CURR_ROW}} ]; +=end comment may_mistreat +=cut + return [ map { defined $_ ? $_->[4]->value() : $_ } @{$self->{CURR_ROW}} ]; } sub row_unformatted { @@ -121,7 +134,11 @@ my $row = $self->next_row(); return unless $row; } +=begin comment may_mistreat return [ map { $_->[4]->unformatted() } @{$self->{CURR_ROW}} ]; +=end comment may_mistreat +=cut + return [ map { defined $_ ? $_->[4]->unformatted() : $_ } @{$self->{CURR_ROW}} ]; } 1;
On Thu Sep 22 05:24:22 2011, SHYOKOU wrote: Show quoted text
> The attached diff file tries to reuse the position of cell to build the > row array.
Tried your patch, but still get same results. Attached are scripts that create and then read a workbook. If you can supply a patch (and preferably a test similar to t/00-base.t in the distribution) that works, I'll apply it. I don't know how much time I'll have to work on this in the near future.
Subject: tst_nulls
Download tst_nulls
application/octet-stream 301b

Message body not shown because it is not plain text.

Subject: create_null_cells
Download create_null_cells
application/octet-stream 498b

Message body not shown because it is not plain text.

On Thu Sep 22 18:21:25 2011, DOUGW wrote: Show quoted text
> On Thu Sep 22 05:24:22 2011, SHYOKOU wrote: > Tried your patch, but still get same results. Attached are scripts that > create and then read a workbook.
My mistake...your patch seems to almost work, but output is: New sheet [][C][D] New sheet [A][][C][D] That first row is missing the "B".
On Thu Sep 22 18:26:40 2011, DOUGW wrote: Show quoted text
> My mistake...your patch seems to almost work, but output is:
... Show quoted text
> That first row is missing the "B".
The problem is that col_range() doesn't work, it always returns (0,-1) probably because, well, the worksheet is not yet parsed, so we don't yet know what the min/max column range is for the worksheet. So the patch can probably be reworked to not worry about what the minimum column is, and then we maybe have some leading null columns in every row.
Fixed. new version 0.02 on the way.
Excuse me for my against to "have some leading null columns in every row" ;-) I spent a while just to look into it, why I didn't but you saw "col_range() doesn't work", found only because it was native Excel versus Spreadsheet::WriteExcel, where whether Spreadsheet::ParseExcel::Worksheet->{MinCol,MaxCol} defined or not even before "the worksheet is not yet parsed" ... I would recommend a "dirty" and "delayed" version to "Initialize row with first cell" right after "my $nxt_cell = $f->();", where col_range () may work after parsing the worksheet; only when both $curr_cell and $nxt_cell are on different sheets, the "dirty" breaks out ;-) Anyway, I don't think it is a good idea to "have some leading null columns in every row" ... On Thu Sep 22 18:54:02 2011, DOUGW wrote: Show quoted text
> On Thu Sep 22 18:26:40 2011, DOUGW wrote:
> > My mistake...your patch seems to almost work, but output is:
> ...
> > That first row is missing the "B".
> > The problem is that col_range() doesn't work, it always returns (0,-1) > probably because, well, the worksheet is not yet parsed, so we don't
yet Show quoted text
> know what the min/max column range is for the worksheet. > > So the patch can probably be reworked to not worry about what the > minimum column is, and then we maybe have some leading null columns in > every row.
Here comes the diff to the v0.03 Stream.pm, w/ delayed row init right after "my $nxt_cell = $f->();", such that no more leading nulls in every row ... Another issue (dunno if a neo ticket 'cause as if sth in upstream Spreadsheet::ParseExcel->new matters) occurred in "my $tmp = my $handler = sub {" while compared side-by-side with the old "bycicle" Spreadsheet::DataFromExcel, where the latter passed no "CellHandler" but Spreadsheet::ParseExcel called previous "$handler" ... On Thu Sep 22 18:54:02 2011, DOUGW wrote: Show quoted text
> On Thu Sep 22 18:26:40 2011, DOUGW wrote:
> > My mistake...your patch seems to almost work, but output is:
> ...
> > That first row is missing the "B".
> > The problem is that col_range() doesn't work, it always returns (0,-1) > probably because, well, the worksheet is not yet parsed, so we don't
yet Show quoted text
> know what the min/max column range is for the worksheet. > > So the patch can probably be reworked to not worry about what the > minimum column is, and then we maybe have some leading null columns in > every row.
Subject: Stream.pm-0.03.diff
--- Stream.pm 2011-09-23 07:44:13.000000000 +0800 +++ Stream.pm 2011-09-23 07:44:13.000000000 +0800 @@ -18,6 +18,7 @@ my ($wb, $idx, $row, $col, $cell); my $tmp = my $handler = sub { ($wb, $idx, $row, $col, $cell) = @_; + defined $parser and ## avoid being trapped by other Spreadsheet::ParseExcel->new instance w/o CellHandler, per "$_CellHandler = $hParam{CellHandler} if ( $hParam{CellHandler} );" $parser->transfer($main); }; @@ -93,13 +94,21 @@ # Initialize row with first cell my @row = (); +=begin comment may_null_prefix $row[ $curr_cell->[3] ] = $curr_cell; +=end comment may_null_prefix +=cut my $nxt_cell = $f->(); + $row[ $curr_cell->[3] - ( $curr_cell->[0] -> worksheet( $curr_cell->[1] ) -> col_range )[0] ] = $curr_cell; # Collect current row on current worksheet while ( $nxt_cell && $nxt_cell->[1] == $curr_cell->[1] && $nxt_cell->[2] == $curr_cell->[2] ) { $curr_cell = $nxt_cell; +=begin comment may_null_prefix $row[ $curr_cell->[3] ] = $curr_cell; +=end comment may_null_prefix +=cut + $row[ $curr_cell->[3] - ( $curr_cell->[0] -> worksheet( $curr_cell->[1] ) -> col_range )[0] ] = $curr_cell; $nxt_cell = $f->(); } $self->{NEXT_CELL} = $nxt_cell;
On Fri Sep 23 03:30:51 2011, SHYOKOU wrote: Show quoted text
> Here comes the diff to the v0.03 Stream.pm, w/ delayed row init right > after "my $nxt_cell = $f->();", such that no more leading nulls in > every row ...
Actually, I would rather have leading nulls in every row than have it "sometimes" not align (when spreadsheet is created w/Spreadsheet::WriteExcel). It would also behave more like Spreadsheet::ParseExcel::Simple which makes no attempt to remove leading empty cells, and I'd like this module to be as much as possible identical to that module. I could, however, see the below behaviour implemented in an option, maybe adding a hashref to the end of the argument list something like ->new('filename.xls', { StripEmptyCells => 1 } ) or something like that. Show quoted text
> Another issue (dunno if a neo ticket 'cause as if sth in upstream > Spreadsheet::ParseExcel->new matters) occurred in "my $tmp = my > $handler = sub {" while compared side-by-side with the old "bycicle" > Spreadsheet::DataFromExcel, where the latter passed no "CellHandler" > but Spreadsheet::ParseExcel called previous "$handler" ...
I think that is a bug in Spreadsheet::ParseExcel. The CellHandler is a package global variable that does not get reset on new S::PE objects that don't have their own CellHandler (and you can't have two spreadsheet objects with their own cellhandlers either). It should be an object/instance attribute, not a global variable. I'd report that to S:PE's maintainer.
Added TrimEmpty option in 0.04