Subject: | Windows Bug When Using PerlIO::gzip To Read "Large" Text File |
Date: | Sun, 6 Aug 2017 06:55:04 +0000 (UTC) |
To: | "bug-PerlIO-gzip [...] rt.cpan.org" <bug-PerlIO-gzip [...] rt.cpan.org> |
From: | Owen Leibman <eclipsechasers2 [...] yahoo.com> |
In Windows, the gzip layer appears to break down
when reading "large" gzipped text files,
where large is not particularly large (in particular nowhere near 2GB).
The problem being reported does not occur in Unix, including Cygwin.
My system is Windows 7 Professional 64-bit.
My Perl distribution is Strawberry Perl, Perl version 5.20.2.
Gzip from GnuWin32 is in my Windows path.
PerlIO::gzip is version 0.20.
Here is a program that creates a text file which demonstrates the problem:
use strict;
use warnings;
use Carp;
use English qw(-no_match_vars);
use Readonly;
Readonly::Scalar my $ROWS => 1_000;
Readonly::Scalar my $COLS => 50;
Readonly::Scalar my $RANDMAX => 1_500_000_000;
Readonly::Scalar my $RANDSUB => 500_000_000;
Readonly::Scalar my $OUTFILE => 'perlio.csv';
open(my $filex, q{>:raw}, $OUTFILE) or croak "$ERRNO";
foreach my $row (1 .. $ROWS) {
foreach my $col (1 .. $COLS) {
print {$filex} sprintf(q{%11d,}, rand($RANDMAX) - $RANDSUB) or croak "$ERRNO";
}
printf {$filex} "\r\n" or croak "$ERRNO";
}
close $filex or croak "$ERRNO";
print "Created $OUTFILE\n" or croak "$ERRNO";
I include "raw" and "\r" just to make sure this wasn't a line-ending problem.
Once the file is created, I gzip it, and then run the following:
use strict;
use warnings;
use Carp;
use English qw(-no_match_vars);
use Readonly;
use PerlIO::gzip;
Readonly::Scalar my $INFILE => 'perlio.csv.gz';
sub processfile {
my ($infile) = @_;
my $recsread = 0;
open(my $filex, q{<:gzip}, $infile) or croak "$ERRNO";
while (my $rec = <$filex>) {
$recsread += 1;
print "rec $recsread size=", length($rec), "\n" or croak "$ERRNO";
}
print 'Records read ', $recsread, "\n" or croak "$ERRNO";
close $filex or croak "$ERRNO";
return;
}
processfile($INFILE);
In Cygwin and Linux, this correctly shows 1,000 lines each 602 bytes long.
In Windows, this shows 808 lines of various lengths (and fails on the close).
The first 79 lines show the expected length; line 80 shows 1216,
line 81 shows 3479, and the rest show values both higher and lower than expected.