Subject: | truncated lined in bzreadline? |
Date: | Tue, 14 Aug 2018 17:08:24 +0900 |
To: | bug-Compress-Bzip2 [...] rt.cpan.org |
From: | Daron Standley <standley [...] biken.osaka-u.ac.jp> |
Hi, I have been playing around with perl for a few hours and I am very
impressed with the speed of reading a huge bz2 compressed file Just to give
some numbers
Time required to read a space-delimited bz2 file with 1000 lines of
length 557780
characters (78890 integers (0-9) separated by white spaces).
python pd.read_csv(file, compression='bz2', header=0): 14 min
python subprocess('bunzip2 -c ' + file): 7 min
perl open('bunzip2 -c $file |'): 66 sec!!
So, I next started trying to use the Bzip2 module. However, I noticed
the bzreadline function was returning only 4096 characters for the files.
So, for example I get the following when using bunzip2 :
my $cmd="bunzip2 -c $fbz2 |";
open(FBZ,$cmd);
while(<FBZ>){
my @line = split(/\s+/);
printf("len %d\n",scalar(@line));
}
close(FBZ);
len 278890
len 278890
.
.
.
But when I use bzreadline as follows:
my $bz = bzopen($fbz, "rb")
or die "Cannot open $fbz: $bzerrno\n" ;
while ($bz->bzreadline($_) > 0 ) {
my @line = split(/\s+/);
printf("len %d\n",scalar(@line));
}
$bz->bzclose() ;
I get
len 2048
len 2048
.
.
I am guessing there is a buffer I can set somewhere, but I couldn't figure
this out by myself. if you have any clues I would be grateful.
Thanks a lot
DMS