Bug #38724 for Parallel-ForkManager: Parallel::ForkManager loops in wait_all

Tue Aug 26 05:05:51 2008 frederik [...] remote.org - Ticket created

Subject:	Parallel::ForkManager loops in wait_all_children
Date:	Tue, 26 Aug 2008 11:03:38 +0200
To:	bug-Parallel-ForkManager [...] rt.cpan.org
From:	Frederik Ramm <frederik [...] remote.org>

Hi, I have a problem under Linux where a rather complex script I did sometimes hangs (in a tight loop) when it runs wait_all_children. I cannot reproduce it with a test script; it only happens in production and only sometimes! I'm not doing anything strange, just instantiating a ForkManager, then every now and then doing a "start" and "finish". No callbacks, nothing. strace()ing a hanging process reveals that it continously calls "wait4" which returns an ECHILD error (no children to wait for). In inspected the source and I believe it must somehow have missed a SIGCHLD so that it thinks there are still child processes while in fact there aren't. I will now try and fix it by changing wait_all_children thus: sub wait_all_children { my ($s)=@_; while (keys %{ $s->{processes} }) { $s->on_wait; $s->wait_one_child(defined $s->{on_wait_period} ? &WNOHANG : undef); if ($! == ECHILD) { delete $s->{processes}; last; } }; } of course this is a very brutal way to do it - would be better to not miss the SIGCHLD in the first place, but at least I hope my program can continue this way. Bye Frederik

Sun Aug 31 07:17:47 2008 dlux [...] dlux.hu - Correspondence added

Hi, Thanks for letting me know. I currently don't really have time for that, but as long as I'll have, I'll check this... Cheers, Balázs

Sun Aug 31 07:18:09 2008 The RT System itself - Status changed from 'new' to 'open'

Sun Aug 31 07:18:30 2008 dlux [...] dlux.hu - Given to DLUX

Sat Nov 22 18:46:57 2008 dlux [...] dlux.hu - Correspondence added

Hi, I'm trying to find a solution which catches all signals, but I am not smarter according to the documentation. Can you help me on this? I wonder maybe the logic in wait_one_child is not perfect. I wonder maybe the NT waitpid implementation is better in linux, too. Do you have time to test it? Balázs On Sun Aug 31 07:17:47 2008, DLUX wrote: Show quoted text

> Hi, > > Thanks for letting me know. I currently don't really have time for that, > but as long as I'll have, I'll check this... > > Cheers, > > Balázs

Sat Nov 22 18:54:48 2008 dlux [...] dlux.hu - Correspondence added

On Sat Nov 22 18:46:57 2008, DLUX wrote: Show quoted text

> Hi, > > I'm trying to find a solution which catches all signals, but I am not > smarter according to the documentation. > > Can you help me on this? I wonder maybe the logic in wait_one_child is > not perfect. I wonder maybe the NT waitpid implementation is better in > linux, too. > > Do you have time to test it? > > Balázs > > On Sun Aug 31 07:17:47 2008, DLUX wrote:

> > Hi, > > > > Thanks for letting me know. I currently don't really have time for that, > > but as long as I'll have, I'll check this... > > > > Cheers, > > > > Balázs

>

Sat Nov 22 18:54:50 2008 dlux [...] dlux.hu - Status changed from 'open' to 'stalled'

Sat Nov 22 19:10:20 2008 dlux [...] dlux.hu - Correspondence added

Do you use the on_wait callback? It temporarily switches off the CHLD signal handling, maybe it causes problem. Could you test it? Unfortunately I am not using this module any more, so I cannot really do that... On Sat Nov 22 18:54:48 2008, DLUX wrote: Show quoted text

> On Sat Nov 22 18:46:57 2008, DLUX wrote:

> > Hi, > > > > I'm trying to find a solution which catches all signals, but I am not > > smarter according to the documentation. > > > > Can you help me on this? I wonder maybe the logic in wait_one_child is > > not perfect. I wonder maybe the NT waitpid implementation is better in > > linux, too. > > > > Do you have time to test it? > > > > Balázs > > > > On Sun Aug 31 07:17:47 2008, DLUX wrote:

> > > Hi, > > > > > > Thanks for letting me know. I currently don't really have time for

that, Show quoted text

> > > but as long as I'll have, I'll check this... > > > > > > Cheers, > > > > > > Balázs

> >

Sat Nov 22 19:10:22 2008 The RT System itself - Status changed from 'stalled' to 'open'

Sat Nov 22 19:10:58 2008 dlux [...] dlux.hu - Status changed from 'open' to 'stalled'

Wed Jun 17 10:47:56 2009 peter [...] makholm.net - Correspondence added

On Tue Aug 26 05:05:51 2008, frederik@remote.org wrote: Show quoted text

> I have a problem under Linux where a rather complex script I did > sometimes hangs (in a tight loop) when it runs wait_all_children.

I've just experienced the same and after a bit of research I found another piece of code doing waitpid(2) calls, probably stealing som pids from Parallel::ForkManager. I know that this configuration isn't supported by Parallel::ForkManager but it would be nice if Parallel::ForkManager was more robust when this happens. Frederik's solution would be one step. Wrapping _waitpid to scan for "missed" processes would be another step. Either way, if you don't use the module anymore and don't have time to maintain it I could offer to take over maintaince of it.

Wed Jun 17 10:47:57 2009 The RT System itself - Status changed from 'stalled' to 'open'

Thu Jun 18 16:43:58 2009 dlux [...] dlux.hu - Correspondence added

Subject:	Re: [rt.cpan.org #38724] Parallel::ForkManager loops in wait_all_children
Date:	Thu, 18 Jun 2009 22:43:20 +0200
To:	bug-Parallel-ForkManager [...] rt.cpan.org
From:	Balázs Szabó <dlux [...] dlux.hu>

Hi Peter, Good ideas! I'm glad to hear that you are volunteering for maintaining the module, and I'm happy to hear that! Please drop me a private email so that we can discuss the details of it! Balázs On Wed, Jun 17, 2009 at 4:47 PM, Peter Makholm via RT < bug-Parallel-ForkManager@rt.cpan.org> wrote: Show quoted text

> Queue: Parallel-ForkManager > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=38724 > > > On Tue Aug 26 05:05:51 2008, frederik@remote.org wrote: >

> > I have a problem under Linux where a rather complex script I did > > sometimes hangs (in a tight loop) when it runs wait_all_children.

> > I've just experienced the same and after a bit of research I found > another piece of code doing waitpid(2) calls, probably stealing som pids > from Parallel::ForkManager. > > I know that this configuration isn't supported by Parallel::ForkManager > but it would be nice if Parallel::ForkManager was more robust when this > happens. > > Frederik's solution would be one step. Wrapping _waitpid to scan for > "missed" processes would be another step. > > Either way, if you don't use the module anymore and don't have time to > maintain it I could offer to take over maintaince of it. >

-- Balázs Szabó (dLux) www.dlux.hu 你很好奇

Mon Feb 15 16:43:56 2010 fbicknel [...] nc.rr.com - Correspondence added

From:

fbicknel [...] nc.rr.com

I haven't seen much activity here of late, but I think I've stumbled into the same situation: I can't figure out why, but sometimes pm will get in a situation where the child processes it should be tracking are gone, but it continues to think they are still there. I fixed this in my own brute-force way by adding this to wait_one_child. I chose to put it here, as that seems to be the go-to method for waiting. Anyway, my addition appears below (line 342 in the sample of code below). If I can find out what is causing the 'dropped' deletes, maybe I could attack the source of the problem rather than just fix it in this brute force way. I'll let you know if I can. I also realize this may not work on other platforms; sorry I can't test it anywhere but Unix. 332 sub wait_one_child { my ($s,$par)=@_; 333 my $kid; 334 while (1) { 335 $kid = $s->_waitpid(-1,$par||=0); 336 last if $kid == 0 || $kid == -1; # AS 5.6/Win32 returns negative PIDs 337 redo if !exists $s->{processes}->{$kid}; 338 my $id = delete $s->{processes}->{$kid}; 339 $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 : 0); 340 last; 341 } 342 # Make sure there are not 'package zombies', i.e. processes 343 # that have exited, but are somehow still in the tracking hash 344 for my $kid (keys %{$s->{'processes'}}) { 345 unless (kill (0, $kid)) { 346 delete $s->{'processes'}{$kid}; 347 } 348 } 349 $kid; 350 };

Mon Feb 15 21:07:33 2010 dlux [...] dlux.hu - Correspondence added

Subject:	Re: [rt.cpan.org #38724] Parallel::ForkManager loops in wait_all_children
Date:	Tue, 16 Feb 2010 02:06:11 +0000
To:	bug-Parallel-ForkManager [...] rt.cpan.org
From:	Balázs Szabó <dlux [...] dlux.hu>

Hi Frank, Thanks for the investigation! I accept patches if you have a good solution! Balázs On Mon, Feb 15, 2010 at 21:43, Frank Bicknell via RT < bug-Parallel-ForkManager@rt.cpan.org> wrote: Show quoted text

> Queue: Parallel-ForkManager > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=38724 > > > I haven't seen much activity here of late, but I think I've stumbled > into the same situation: I can't figure out why, but sometimes pm will > get in a situation where the child processes it should be tracking are > gone, but it continues to think they are still there. > > I fixed this in my own brute-force way by adding this to wait_one_child. > I chose to put it here, as that seems to be the go-to method for waiting. > > Anyway, my addition appears below (line 342 in the sample of code > below). If I can find out what is causing the 'dropped' deletes, maybe > I could attack the source of the problem rather than just fix it in this > brute force way. I'll let you know if I can. > > I also realize this may not work on other platforms; sorry I can't test > it anywhere but Unix. > > 332 sub wait_one_child { my ($s,$par)=@_; > 333 my $kid; > 334 while (1) { > 335 $kid = $s->_waitpid(-1,$par||=0); > 336 last if $kid == 0 || $kid == -1; # AS 5.6/Win32 returns > negative PIDs > 337 redo if !exists $s->{processes}->{$kid}; > 338 my $id = delete $s->{processes}->{$kid}; > 339 $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 > : 0); > 340 last; > 341 } > 342 # Make sure there are not 'package zombies', i.e. processes > 343 # that have exited, but are somehow still in the tracking hash > 344 for my $kid (keys %{$s->{'processes'}}) { > 345 unless (kill (0, $kid)) { > 346 delete $s->{'processes'}{$kid}; > 347 } > 348 } > 349 $kid; > 350 }; > >

-- Balázs Szabó (dLux) www.dlux.hu 你很好奇

Mon Oct 18 11:33:05 2010 dlux [...] dlux.hu - Correspondence added

Hi all, Is it happening to you? I wonder what could cause this. Frank's solution should work, I have only one thing to worry about: the return value of the child process. We have to call the on_finish callback with some return value. Balázs On Mon Feb 15 21:07:33 2010, DLUX wrote: Show quoted text

> Hi Frank, > > Thanks for the investigation! > > I accept patches if you have a good solution! > > Balázs > > On Mon, Feb 15, 2010 at 21:43, Frank Bicknell via RT < > bug-Parallel-ForkManager@rt.cpan.org> wrote: >

> > Queue: Parallel-ForkManager > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=38724 > > > > > I haven't seen much activity here of late, but I think I've stumbled > > into the same situation: I can't figure out why, but sometimes pm will > > get in a situation where the child processes it should be tracking are > > gone, but it continues to think they are still there. > > > > I fixed this in my own brute-force way by adding this to wait_one_child. > > I chose to put it here, as that seems to be the go-to method for waiting. > > > > Anyway, my addition appears below (line 342 in the sample of code > > below). If I can find out what is causing the 'dropped' deletes, maybe > > I could attack the source of the problem rather than just fix it in this > > brute force way. I'll let you know if I can. > > > > I also realize this may not work on other platforms; sorry I can't test > > it anywhere but Unix. > > > > 332 sub wait_one_child { my ($s,$par)=@_; > > 333 my $kid; > > 334 while (1) { > > 335 $kid = $s->_waitpid(-1,$par||=0); > > 336 last if $kid == 0 || $kid == -1; # AS 5.6/Win32 returns > > negative PIDs > > 337 redo if !exists $s->{processes}->{$kid}; > > 338 my $id = delete $s->{processes}->{$kid}; > > 339 $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 > > : 0); > > 340 last; > > 341 } > > 342 # Make sure there are not 'package zombies', i.e. processes > > 343 # that have exited, but are somehow still in the tracking hash > > 344 for my $kid (keys %{$s->{'processes'}}) { > > 345 unless (kill (0, $kid)) { > > 346 delete $s->{'processes'}{$kid}; > > 347 } > > 348 } > > 349 $kid; > > 350 }; > > > >

> >

Mon Oct 18 11:33:05 2010 dlux [...] dlux.hu - Status changed from 'open' to 'stalled'

Wed Jul 03 11:04:51 2013 BIAFRA [...] cpan.org - Correspondence added

This happens to me under a Starman/Plack/Dancer system on Linux. Couldn't figured why but after some time (or accesses) waitpid() starts returning -1 for all the children. It runs well as I can check the output of each child for retrieving data over /tmp. Attached a patch based on Frank Bicknell suggestion that calls "on_finish" so you can retrieve data produced by each "lost" child.

Subject:

Parallel-ForkManager.diff

--- ForkManager.pm.orig 2013-07-03 15:47:46.870631541 +0100 +++ ForkManager.pm 2013-07-03 15:47:14.138469231 +0100 @@ -557,6 +557,38 @@ $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 : 0, $retrieved); last; } + + # https://rt.cpan.org/Public/Bug/Display.html?id=38724 + if ( $kid == -1 ) { + + # Make sure there are not 'package zombies', i.e. processes + # that have exited, but are somehow still in the tracking hash + + for my $kid (keys %{$s->{'processes'}}) { + unless (kill (0, $kid)) { + + # retrieve child data structure, if any + my $retrieved = undef; + my $storable_tempfile = File::Spec->catfile($s->{tempdir}, 'Parallel-ForkManager-' . $$ . '-' . $kid . '.txt'); + if (-e $storable_tempfile) { # child has option of not storing anything, so we need to see if it did or not + $retrieved = eval { return &retrieve($storable_tempfile); }; + + # handle Storables errors + if (not $retrieved or $@) { + warn(qq|The storable module was unable to retrieve the child's data structure from the temporary file "$storable_tempfile": | . join(', ', $@)); + } + + # clean up after ourselves + unlink $storable_tempfile; + } + + my $id = delete $s->{processes}->{$kid}; + + $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 : 0, $retrieved); + } + } + } + $kid; };

Wed Jul 03 11:04:51 2013 The RT System itself - Status changed from 'stalled' to 'open'

Bug #38724 for Parallel-ForkManager: Parallel::ForkManager loops in wait_all_children

Preferred bug tracker