Subject: | Crash when restarting workers with higher # workers |
Date: | Fri, 18 Apr 2014 21:11:23 -0700 |
To: | bug-MCE [...] rt.cpan.org |
From: | Shawn Halpenny <paxunix [...] gmail.com> |
I have a long-running process where each worker must occasionally process a
large amount of data and ends up consuming a lot of memory. To prevent the
workers from starving the system, I exit each one after it has processed
its data, and use on_post_exit to restart it (similar to what's shown in
https://metacpan.org/pod/MCE::Core#restart_worker-wid).
However, I've noticed that with a larger number of workers, it appears that
their IPC data communicated to the manager seems to get corrupted, killing
the entire job.
This code should reproduce the error:
---------->8-----------
use MCE::Loop;
my $count = shift || 10;
sub iter
{
return if ($count-- <= 0);
return rand();
}
MCE::Loop::init {
max_workers => $ENV{NUMWORKERS} || 1,
gather => sub { return },
on_post_exit => sub {
my ($self, $e) = @_;
MCE->restart_worker($e->{wid});
},
};
mce_loop {
my ($mce, $chunkRef, $chunkId) = @_;
$mce->gather($chunkId, $chunkRef->[0]);
exit(0);
} \&iter;
---------->8-----------
Run like so:
NUMWORKERS=40 script.pl 10000
You may have to vary NUMWORKERS and the number of input elements, but on
the two systems I've tried this on (RHEL5_64 perl5.8.8, and Ubuntu 12.04
LTS perl 5.14.2), it always crashes like this:
Storable binary image v40.73 more recent than I am (v2.8) at
/usr/lib/perl/5.14/Storable.pm line 416, <$__ANONIO__> line 1644, at
/home/halpenny/local/perl5/lib/perl5/MCE/Core/Manager.pm line 448
## mcelooptest2.pl: caught signal '__DIE__', exiting
or like this:
Argument "^D4497" isn't numeric in abs at
/local/perl/lib/perl5.8/MCE/Core/Manager.pm line 157, <$__ANONIO__> line
5114.
Argument "" isn't numeric in array element at
/local/perl/lib/perl5.8/MCE/Core/Manager.pm line 447, <$__ANONIO__> line
5116.
Magic number checking on storable string failed at
/local/perl/lib/perl5.8/Linux-2.6c2.5-x86_64-64int/Storable.pm line 417,
<$__ANONIO__> line 5116, at /local/perl/lib/perl5.8/MCE/Core/Manager.pm
line 448.
## script.pl: caught signal '__DIE__', exiting
If I don't exit+restart the worker, this script runs with no problems. It
also has no problems if the number of workers is smaller (like 2). I've
tried it on 2-cpu machines, and 16-cpu machines with the same results.