Skip Menu |

This queue is for tickets about the PerlIO-gzip CPAN distribution.

Report information
The Basics
Id: 114557
Status: open
Priority: 0/
Queue: PerlIO-gzip

People
Owner: Nobody in particular
Requestors: Pascal [...] rkfd.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.19
Fixed in: (no value)



Subject: PerlIO::gzip and Parallel::ForkManager do not play nice together
Please note that file reads are always done in the main thread below. Although other threads are created, nothing is actually done in them. I've tried the below code on a couple different Linux boxes. They all seem to get data corruption reading the index. The below works fine if you un-rem the 'next' (disabling Parallel::ForkManager) or gunzip the index beforehand and remove the ':gzip' (disabling PerlIO::gzip). Number of concurrent threads does not appear to matter. The corruption appears to always start at about the same line number for each index, but at different line numbers for different indexes. Running the same thing multiple times will sometimes yield the same exact corruption and sometimes not. A couple indexes (100K each) you can test with: <a href="https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-27/wat.paths.gz">2015-27</a> <a href="https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-48/wat.paths.gz">2015-48</a> #!/usr/bin/perl use strict; use warnings; use Parallel::ForkManager; use PerlIO::gzip; my $pm = Parallel::ForkManager->new(12); open(IN, '<:gzip', 'wat.paths.gz') or die "can't open index"; while (my $file = <IN>) { print length($file) . ":$file\n" if length($file) > 142; #next; next if $pm->start; $pm->finish; } close IN; $pm->wait_all_children;
From: Pascal [...] rkfd.com
On 2016-05-21 16:48:10, Pascal@rkfd.com wrote: Show quoted text
> Matching Parallel::ForkManager bug: https://github.com/dluxhu/perl- > parallel-forkmanager/issues/11
A minimal test setup without Parallel::ForkManager: First create a gzipped file with sequential numbers: perl -e 'for(1..100000){print "$_\n"}' | gzip > /tmp/test.gz Run the following modified test script, now doing only a single fork: #!/usr/bin/perl use strict; use warnings; use PerlIO::gzip; my $forked; open(IN, '<:gzip', '/tmp/test.gz') or die "can't open index"; while (my $line = <IN>) { print STDERR $line; if (!$forked) { if (fork == 0) { exit; #require POSIX; POSIX::_exit(0); } $forked = 1; } } close IN; __END__ On my system, the corruption starts at line 3493. Looking in a strace log, this is the point when the 2nd read on the gzipped stream happens. There's no problem if POSIX::_exit() is used instead of perl's exit(). So it could be that somehow the global object destruction (which is skipped when POSIX::_exit is used) is causing the problem. Regards, Slaven
On 2016-05-21 16:44:39, Pascal@rkfd.com wrote: Show quoted text
> Please note that file reads are always done in the main thread below. > Although other threads are created, nothing is actually done in them. > > I've tried the below code on a couple different Linux boxes. They all > seem to get data corruption reading the index. The below works fine > if you un-rem the 'next' (disabling Parallel::ForkManager) or gunzip > the index beforehand and remove the ':gzip' (disabling PerlIO::gzip). > Number of concurrent threads does not appear to matter. The > corruption appears to always start at about the same line number for > each index, but at different line numbers for different indexes. > Running the same thing multiple times will sometimes yield the same > exact corruption and sometimes not. A couple indexes (100K each) you > can test with: <a href="https://aws- > publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015- > 27/wat.paths.gz">2015-27</a> <a href="https://aws- > publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015- > 48/wat.paths.gz">2015-48</a> > > #!/usr/bin/perl > > use strict; > use warnings; > > use Parallel::ForkManager; > use PerlIO::gzip; > > my $pm = Parallel::ForkManager->new(12); > > open(IN, '<:gzip', 'wat.paths.gz') or die "can't open index"; > while (my $file = <IN>) { > print length($file) . ":$file\n" if length($file) > 142; > #next; > next if $pm->start; > $pm->finish; > } > close IN; > $pm->wait_all_children;
Can you check if the problem goes away if you add the "unix" layer? open(IN, '<:unix:gzip', ...
From: Pascal [...] rkfd.com
On Thu Jun 09 16:49:02 2016, SREZIC wrote: Show quoted text
> Can you check if the problem goes away if you add the "unix" layer? > > open(IN, '<:unix:gzip', ...
Yes, that does appear to fix the problem I was having. Thank you for looking into it.