Subject: | PerlIO::gzip and Parallel::ForkManager do not play nice together |
Please note that file reads are always done in the main thread below. Although other threads are created, nothing is actually done in them.
I've tried the below code on a couple different Linux boxes. They all seem to get data corruption reading the index. The below works fine if you un-rem the 'next' (disabling Parallel::ForkManager) or gunzip the index beforehand and remove the ':gzip' (disabling PerlIO::gzip). Number of concurrent threads does not appear to matter. The corruption appears to always start at about the same line number for each index, but at different line numbers for different indexes. Running the same thing multiple times will sometimes yield the same exact corruption and sometimes not. A couple indexes (100K each) you can test with: <a href="https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-27/wat.paths.gz">2015-27</a> <a href="https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-48/wat.paths.gz">2015-48</a>
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
use PerlIO::gzip;
my $pm = Parallel::ForkManager->new(12);
open(IN, '<:gzip', 'wat.paths.gz') or die "can't open index";
while (my $file = <IN>) {
print length($file) . ":$file\n" if length($file) > 142;
#next;
next if $pm->start;
$pm->finish;
}
close IN;
$pm->wait_all_children;