Bug #94869 for MCE: Crash when restarting workers with higher # workers

Sat Apr 19 00:12:14 2014 paxunix [...] gmail.com - Ticket created

Subject:	Crash when restarting workers with higher # workers
Date:	Fri, 18 Apr 2014 21:11:23 -0700
To:	bug-MCE [...] rt.cpan.org
From:	Shawn Halpenny <paxunix [...] gmail.com>

I have a long-running process where each worker must occasionally process a large amount of data and ends up consuming a lot of memory. To prevent the workers from starving the system, I exit each one after it has processed its data, and use on_post_exit to restart it (similar to what's shown in https://metacpan.org/pod/MCE::Core#restart_worker-wid). However, I've noticed that with a larger number of workers, it appears that their IPC data communicated to the manager seems to get corrupted, killing the entire job. This code should reproduce the error: ---------->8----------- use MCE::Loop; my $count = shift || 10; sub iter { return if ($count-- <= 0); return rand(); } MCE::Loop::init { max_workers => $ENV{NUMWORKERS} || 1, gather => sub { return }, on_post_exit => sub { my ($self, $e) = @_; MCE->restart_worker($e->{wid}); }, }; mce_loop { my ($mce, $chunkRef, $chunkId) = @_; $mce->gather($chunkId, $chunkRef->[0]); exit(0); } \&iter; ---------->8----------- Run like so: NUMWORKERS=40 script.pl 10000 You may have to vary NUMWORKERS and the number of input elements, but on the two systems I've tried this on (RHEL5_64 perl5.8.8, and Ubuntu 12.04 LTS perl 5.14.2), it always crashes like this: Storable binary image v40.73 more recent than I am (v2.8) at /usr/lib/perl/5.14/Storable.pm line 416, <$__ANONIO__> line 1644, at /home/halpenny/local/perl5/lib/perl5/MCE/Core/Manager.pm line 448 ## mcelooptest2.pl: caught signal '__DIE__', exiting or like this: Argument "^D4497" isn't numeric in abs at /local/perl/lib/perl5.8/MCE/Core/Manager.pm line 157, <$__ANONIO__> line 5114. Argument "" isn't numeric in array element at /local/perl/lib/perl5.8/MCE/Core/Manager.pm line 447, <$__ANONIO__> line 5116. Magic number checking on storable string failed at /local/perl/lib/perl5.8/Linux-2.6c2.5-x86_64-64int/Storable.pm line 417, <$__ANONIO__> line 5116, at /local/perl/lib/perl5.8/MCE/Core/Manager.pm line 448. ## script.pl: caught signal '__DIE__', exiting If I don't exit+restart the worker, this script runs with no problems. It also has no problems if the number of workers is smaller (like 2). I've tried it on 2-cpu machines, and 16-cpu machines with the same results.

Sat Apr 19 02:02:40 2014 marioeroy [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #94869] Crash when restarting workers with higher # workers
Date:	Sat, 19 Apr 2014 02:02:27 -0400
To:	bug-MCE [...] rt.cpan.org
From:	Mario Roy <marioeroy [...] gmail.com>

Thank you for the report. You can try the following in the meantime. Open up lib/MCE.pm and search for the DATA_CHANNELS constant. Change the value from 8 to match the value for NUMWORKERS (Line 2002 from MCE 1.512). DATA_CHANNELS => 80, ## Maximum IPC "DATA" channels $ NUMWORKERS=80 ./restart.pl 10000 $ NUMWORKERS=80 ./restart.pl 100000 $ NUMWORKERS=200 ./restart.pl 100000 (I changed DATA_CHANNELS => 200) The above worked flawlessly. The solution may be to allow one to specify the number of DATA_CHANNELS used internally by MCE. Currently, workers share data channels. Restarting a worker may cause issues at the socket level due to socket initialization after the worker has started (not sure at this time, may possibly be at the Perl/OS level). For apps that restart workers often, the solution may very well be for each worker to be given it's own unique data channel. data_channels => 'auto', <--- will set this to the same value as max_workers data_channels => NUM, <--- default is 8, one can override by specifying a value 1 or higher I will reduce the time delay behind the scene which is needed by Cgywin/Windows and not UNIX. I commented out the 0.002 delay at the end of the restart_worker function inside lib/MCE.pm. I will make another subtle change to the exit function as well (not necessary to get your app to run now). Increasing DATA_CHANNELS increases the number of file handles. However, it does eliminate the need for file locking during IPC due to each worker assigned a unique data channel. Regards, Mario On Sat, Apr 19, 2014 at 12:12 AM, Shawn Halpenny via RT <bug-MCE@rt.cpan.org Show quoted text

> wrote:

Show quoted text

> Sat Apr 19 00:12:14 2014: Request 94869 was acted upon. > Transaction: Ticket created by paxunix@gmail.com > Queue: MCE > Subject: Crash when restarting workers with higher # workers > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: paxunix@gmail.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=94869 > > > > I have a long-running process where each worker must occasionally process a > large amount of data and ends up consuming a lot of memory. To prevent the > workers from starving the system, I exit each one after it has processed > its data, and use on_post_exit to restart it (similar to what's shown in > https://metacpan.org/pod/MCE::Core#restart_worker-wid). > > However, I've noticed that with a larger number of workers, it appears that > their IPC data communicated to the manager seems to get corrupted, killing > the entire job. > > This code should reproduce the error: > > ---------->8----------- > > use MCE::Loop; > > my $count = shift || 10; > > sub iter > { > return if ($count-- <= 0); > return rand(); > } > > MCE::Loop::init { > max_workers => $ENV{NUMWORKERS} || 1, > gather => sub { return }, > on_post_exit => sub { > my ($self, $e) = @_; > MCE->restart_worker($e->{wid}); > }, > > }; > > mce_loop { > my ($mce, $chunkRef, $chunkId) = @_; > > $mce->gather($chunkId, $chunkRef->[0]); > exit(0); > } \&iter; > > ---------->8----------- > > Run like so: > NUMWORKERS=40 script.pl 10000 > > You may have to vary NUMWORKERS and the number of input elements, but on > the two systems I've tried this on (RHEL5_64 perl5.8.8, and Ubuntu 12.04 > LTS perl 5.14.2), it always crashes like this: > > > Storable binary image v40.73 more recent than I am (v2.8) at > /usr/lib/perl/5.14/Storable.pm line 416, <$__ANONIO__> line 1644, at > /home/halpenny/local/perl5/lib/perl5/MCE/Core/Manager.pm line 448 > > ## mcelooptest2.pl: caught signal '__DIE__', exiting > > > or like this: > > > Argument "^D4497" isn't numeric in abs at > /local/perl/lib/perl5.8/MCE/Core/Manager.pm line 157, <$__ANONIO__> line > 5114. > Argument "" isn't numeric in array element at > /local/perl/lib/perl5.8/MCE/Core/Manager.pm line 447, <$__ANONIO__> line > 5116. > Magic number checking on storable string failed at > /local/perl/lib/perl5.8/Linux-2.6c2.5-x86_64-64int/Storable.pm line 417, > <$__ANONIO__> line 5116, at /local/perl/lib/perl5.8/MCE/Core/Manager.pm > line 448. > > ## script.pl: caught signal '__DIE__', exiting > > > If I don't exit+restart the worker, this script runs with no problems. It > also has no problems if the number of workers is smaller (like 2). I've > tried it on 2-cpu machines, and 16-cpu machines with the same results. > >

Sat Apr 19 02:02:40 2014 The RT System itself - Status changed from 'new' to 'open'

Sat Apr 19 02:39:58 2014 marioeroy [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #94869] Crash when restarting workers with higher # workers
Date:	Sat, 19 Apr 2014 02:39:49 -0400
To:	bug-MCE [...] rt.cpan.org
From:	Mario Roy <marioeroy [...] gmail.com>

On a good note, there is "no" need to increase the constant value for DATA_CHANNELS. The fix has been committed for RT#94869. I will test with the various environments next week. The fix solves the issue for the Linux environment and perhaps other environments too. https://code.google.com/p/many-core-engine-perl/source/detail?r=530 Regards, Mario

Sat Apr 19 18:48:16 2014 marioeroy [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #94869] Crash when restarting workers with higher # workers
Date:	Sat, 19 Apr 2014 18:48:07 -0400
To:	bug-MCE [...] rt.cpan.org
From:	Mario Roy <marioeroy [...] gmail.com>

Hi Shawn, Am happy to report that this has been resolved successfully with SVN commit r531 and will be included with the next MCE 1.513 release. https://code.google.com/p/many-core-engine-perl/source/detail?r=531 Recommended is to use MCE->exit() for better handling, especially under the Windows Environment. The $e->{wid} argument is no longer necessary starting with the 1.5 released. The perldoc MCE::Core.pod was updated to reflect this. use MCE::Loop; my $count = shift || 10; my $n = 0; sub iter { return if ($count-- <= 0); return rand(); } MCE::Loop::init { max_workers => $ENV{NUMWORKERS} || 1, gather => sub { return }, on_post_exit => sub { my ($mce, $e) = @_; print ++$n, ": $e->{wid}: $e->{pid}\n"; MCE->restart_worker(); } }; mce_loop { my ($mce, $chunkRef, $chunkId) = @_; MCE->gather($chunkId, $chunkRef->[0]); MCE->exit(0); } \&iter; Regards, Mario

Sat Apr 19 19:22:16 2014 MARIOROY [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Apr 20 07:38:43 2014 MARIOROY [...] cpan.org - Fixed in 1.513 added

Bug #94869 for MCE: Crash when restarting workers with higher # workers

Preferred bug tracker