Bug #131058 for Net-WebSocket-Server: on_tick stops being called after 7-8 hours of constant operation

Sat Nov 23 06:28:30 2019 pliablepixels [...] gmail.com - Ticket created

Subject:	on_tick stops being called after 7-8 hours of constant operation
Date:	Sat, 23 Nov 2019 06:28:02 -0500
To:	bug-Net-WebSocket-Server [...] rt.cpan.org
From:	Pliable Pixels <pliablepixels [...] gmail.com>

Hi, thanks for the great library. I am using your library to implement an event server of sorts that receives alarm notifications from an open source home security server (ZoneMinder) and does machine learning to detect objects in the alarm feed. The core "loop" of my server relies on your on_tick callback here https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2320-L2351 What I've been observing is that after around 7-8 hours of smooth operation, the 'on_tick' handler is just not called. I inserted debugging statements inside /usr/local/share/perl/5.26.1/Net/WebSocket/Server.pm to log messages when it invokes the on_tick and have confirmed it does not. Nor does it exit the start() sub. I've now set it up to log more debug statements after each step (select/etc) inside server.pm but wanted to reach out to you to ask if you've experienced this in the past and know what this may be? At the time it "locks up" my server is still running (hasn't crashed) but obviously I can't connect to it, because its locked somewhere. I'm still debugging, but thought I'd ask :-) thx

Sat Nov 23 20:20:18 2019 TOPAZ [...] cpan.org - Correspondence added

On Sat Nov 23 06:28:30 2019, pliablepixels@gmail.com wrote: Show quoted text

> Hi, thanks for the great library. I am using your library to implement > an > event server of sorts that receives alarm notifications from an open > source > home security server (ZoneMinder) and does machine learning to detect > objects in the alarm feed. > > The core "loop" of my server relies on your on_tick callback here > https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2320- > L2351 > > What I've been observing is that after around 7-8 hours of smooth > operation, the 'on_tick' handler is just not called. I inserted > debugging > statements inside /usr/local/share/perl/5.26.1/Net/WebSocket/Server.pm > to > log messages when it invokes the on_tick and have confirmed it does > not. > Nor does it exit the start() sub. > > I've now set it up to log more debug statements after each step > (select/etc) inside server.pm but wanted to reach out to you to ask if > you've experienced this in the past and know what this may be? > > At the time it "locks up" my server is still running (hasn't crashed) > but > obviously I can't connect to it, because its locked somewhere. > > I'm still debugging, but thought I'd ask :-) > > thx

I don't know of a bug like this, but that doesn't mean there isn't one. Using fork() inside any handler for a server is tricky business; that might be the cause. fork() can do weird things with cloning filehandles or sharing sockets. I'm not exactly sure how Net::WebSocket::Server behaves when forked, but it might help to directly close your open file descriptors before continuing in the child process. (Which descriptors you close depends on your application.) There are some modules that make forking a child process safer, like Proc::Daemon, if that suits your needs. If you want to just try closing the file descriptors directly without everything else Proc::Daemon does, it closes all of your open file descriptors by: 1. Getting the highest descriptor id: https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L461 2. Looping over them and calling POSIX::close() conditionally https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L224 If that's not the issue, it might help in your debugging to log the value of $timeout just before the call to select() - https://metacpan.org/source/TOPAZ/Net-WebSocket-Server-0.003004/lib/Net/WebSocket/Server.pm#L157 - that would tell you 1. whether the event loop is re-entering the select() consistently and 2. whether the amount of time it intends to wait looks sane. Because it's single-threaded, also look out for bugs where one of your handlers (or the server code itself, somehow) is blocking on some kind of operation. I'd also check for memory leaks; see if your server process is using increasingly larger amounts of memory as it runs over several hours. If all else fails, you should at least be able to narrow down which section of the main loop in start() gets stuck with some carefully-placed log lines. Please let me know what you find; I'd love to fix a bug if I have one (or maybe add better documentation about how to avoid issues in callbacks if there's a subtle bug in your code somewhere).

Sat Nov 23 20:20:18 2019 The RT System itself - Status changed from 'new' to 'open'

Sun Nov 24 21:55:43 2019 pliablepixels [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #131058] on_tick stops being called after 7-8 hours of constant operation
Date:	Sun, 24 Nov 2019 21:55:17 -0500
To:	bug-Net-WebSocket-Server [...] rt.cpan.org
From:	Pliable Pixels <pliablepixels [...] gmail.com>

Hi Eric, thank you for the detailed explanation. Unfortunately, now that I've added debug prints everywhere the crash stopped occurring. I've decided to write a small program that mimics my operation with your code without the complexities of my in fork code. I added in the code to close the fds in the child per your suggestion. The gist is here https://gist.github.com/pliablepixels/440e14532beac01af0453f9f7e322519 When the child exits, I see Warning: unable to close filehandle properly: Bad file descriptor during global destruction. Would you know if I'm doing something wrong? It seems the code is closing an fd is should not. I'm skipping 0-2 (in/out/err) and the code reports the first fd it closes is 4. Thanks On Sat, Nov 23, 2019 at 8:20 PM Eric Wastl via RT < bug-Net-WebSocket-Server@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=131058 > > > On Sat Nov 23 06:28:30 2019, pliablepixels@gmail.com wrote:

> > Hi, thanks for the great library. I am using your library to implement > > an > > event server of sorts that receives alarm notifications from an open > > source > > home security server (ZoneMinder) and does machine learning to detect > > objects in the alarm feed. > > > > The core "loop" of my server relies on your on_tick callback here > >

> https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2320-

> > L2351 > > > > What I've been observing is that after around 7-8 hours of smooth > > operation, the 'on_tick' handler is just not called. I inserted > > debugging > > statements inside /usr/local/share/perl/5.26.1/Net/WebSocket/Server.pm > > to > > log messages when it invokes the on_tick and have confirmed it does > > not. > > Nor does it exit the start() sub. > > > > I've now set it up to log more debug statements after each step > > (select/etc) inside server.pm but wanted to reach out to you to ask if > > you've experienced this in the past and know what this may be? > > > > At the time it "locks up" my server is still running (hasn't crashed) > > but > > obviously I can't connect to it, because its locked somewhere. > > > > I'm still debugging, but thought I'd ask :-) > > > > thx

> > > I don't know of a bug like this, but that doesn't mean there isn't one. > > Using fork() inside any handler for a server is tricky business; that > might be the cause. fork() can do weird things with cloning filehandles or > sharing sockets. I'm not exactly sure how Net::WebSocket::Server behaves > when forked, but it might help to directly close your open file descriptors > before continuing in the child process. (Which descriptors you close > depends on your application.) There are some modules that make forking a > child process safer, like Proc::Daemon, if that suits your needs. If you > want to just try closing the file descriptors directly without everything > else Proc::Daemon does, it closes all of your open file descriptors by: > 1. Getting the highest descriptor id: > https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L461 > 2. Looping over them and calling POSIX::close() conditionally > https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L224 > > If that's not the issue, it might help in your debugging to log the value > of $timeout just before the call to select() - > https://metacpan.org/source/TOPAZ/Net-WebSocket-Server-0.003004/lib/Net/WebSocket/Server.pm#L157 > - that would tell you 1. whether the event loop is re-entering the select() > consistently and 2. whether the amount of time it intends to wait looks > sane. > > Because it's single-threaded, also look out for bugs where one of your > handlers (or the server code itself, somehow) is blocking on some kind of > operation. > > I'd also check for memory leaks; see if your server process is using > increasingly larger amounts of memory as it runs over several hours. > > If all else fails, you should at least be able to narrow down which > section of the main loop in start() gets stuck with some carefully-placed > log lines. > > Please let me know what you find; I'd love to fix a bug if I have one (or > maybe add better documentation about how to avoid issues in callbacks if > there's a subtle bug in your code somewhere). >

Sat Nov 30 09:13:32 2019 pliablepixels [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #131058] on_tick stops being called after 7-8 hours of constant operation
Date:	Sat, 30 Nov 2019 09:13:10 -0500
To:	bug-Net-WebSocket-Server [...] rt.cpan.org
From:	Pliable Pixels <pliablepixels [...] gmail.com>

Okay, I really do think this has something to do with the code inside server.pm and not forking. What exactly, I haven't found yet. First, here are some logs from Server.pm (I added prints): Fri Nov 29 10:28:48 2019************ core: before select with timeout:4.99757385253906 Fri Nov 29 10:28:53 2019************ core: after select Fri Nov 29 10:28:53 2019************ core: before foreach ready_write Fri Nov 29 10:28:53 2019************ core: after foreach ready_write Fri Nov 29 10:28:53 2019************ core: tick period:5 tick_next:1575041333.49427 time: 1575041333.49959 Fri Nov 29 10:28:53 2019************* calling on_tick Fri Nov 29 10:28:53 2019************ core: inside while Fri Nov 29 10:28:53 2019************ core: before select with timeout:4.09693813323975 Fri Nov 29 10:28:57 2019************ core: after select Fri Nov 29 10:28:57 2019************ core: before foreach ready_write Fri Nov 29 10:28:57 2019************ core: after foreach ready_write Fri Nov 29 10:28:57 2019************ core: tick period:5 tick_next:1575041338.49427 time: 1575041337.60232 Fri Nov 29 10:28:57 2019************ core: inside while Fri Nov 29 10:28:57 2019************ core: before select with timeout:0.891774892807007 Fri Nov 29 10:28:58 2019************ core: after select Fri Nov 29 10:28:58 2019************ core: before foreach ready_write Fri Nov 29 10:28:58 2019************ core: after foreach ready_write Fri Nov 29 10:28:58 2019************ core: tick period:5 tick_next:1575041338.49427 time: 1575041338.49548 Fri Nov 29 10:28:58 2019************* calling on_tick Fri Nov 29 10:28:58 2019************ core: inside while Fri Nov 29 10:28:58 2019************ core: before select with timeout:4.99761295318604 Fri Nov 29 10:29:03 2019************ core: after select Fri Nov 29 10:29:03 2019************ core: before foreach ready_write Fri Nov 29 10:29:03 2019************ core: after foreach ready_write Fri Nov 29 10:29:03 2019************ core: tick period:5 tick_next:1575041343.49427 time: 1575041343.49968 Fri Nov 29 10:29:03 2019************* calling on_tick Fri Nov 29 10:29:03 2019************ core: inside while Fri Nov 29 10:29:03 2019************ core: before select with timeout:4.10133290290833 Fri Nov 29 10:29:07 2019************ core: after select - - - - - stuck at this point - - - - - Now here is the modified server.pm file https://gist.github.com/pliablepixels/348281b4445ac044552917970bf4b47b So basically, it gets stuck in this loop https://gist.github.com/pliablepixels/348281b4445ac044552917970bf4b47b#file-server-pm-L169-L187 I can then send the process a HUP (I have a HUP handler) that then gets it out of the stuck loop but it gets stuck again. I can repeat this every time I send a HUP So I just sent a HUP to that process, and it gets out of that loop, but gets stuck again in the next iteration. Sat Nov 30 09:07:14 2019************ core: before foreach ready_write Sat Nov 30 09:07:14 2019************ core: after foreach ready_write Sat Nov 30 09:07:14 2019************ core: tick period:5 tick_next:1575041353.49427 time: 1575122834.35031 Sat Nov 30 09:07:14 2019************* calling on_tick Sat Nov 30 09:07:14 2019************ core: inside while Sat Nov 30 09:07:14 2019************ core: before select with timeout:-81475.8615190983 Sat Nov 30 09:07:14 2019************ core: after select On Sun, Nov 24, 2019 at 9:55 PM Pliable Pixels <pliablepixels@gmail.com> wrote: Show quoted text

> Hi Eric, thank you for the detailed explanation. > Unfortunately, now that I've added debug prints everywhere the crash > stopped occurring. > > I've decided to write a small program that mimics my operation with your > code without the complexities of my in fork code. I added in the code to > close the fds in the child per your suggestion. > > The gist is here > https://gist.github.com/pliablepixels/440e14532beac01af0453f9f7e322519 > > When the child exits, I see > Warning: unable to close filehandle properly: Bad file descriptor during > global destruction. > > Would you know if I'm doing something wrong? It seems the code is closing > an fd is should not. I'm skipping 0-2 (in/out/err) and the code reports the > first fd it closes is 4. > > Thanks > > > On Sat, Nov 23, 2019 at 8:20 PM Eric Wastl via RT < > bug-Net-WebSocket-Server@rt.cpan.org> wrote: >

>> <URL: https://rt.cpan.org/Ticket/Display.html?id=131058 > >> >> On Sat Nov 23 06:28:30 2019, pliablepixels@gmail.com wrote:

>> > Hi, thanks for the great library. I am using your library to implement >> > an >> > event server of sorts that receives alarm notifications from an open >> > source >> > home security server (ZoneMinder) and does machine learning to detect >> > objects in the alarm feed. >> > >> > The core "loop" of my server relies on your on_tick callback here >> >

>> https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2320-

>> > L2351 >> > >> > What I've been observing is that after around 7-8 hours of smooth >> > operation, the 'on_tick' handler is just not called. I inserted >> > debugging >> > statements inside /usr/local/share/perl/5.26.1/Net/WebSocket/Server.pm >> > to >> > log messages when it invokes the on_tick and have confirmed it does >> > not. >> > Nor does it exit the start() sub. >> > >> > I've now set it up to log more debug statements after each step >> > (select/etc) inside server.pm but wanted to reach out to you to ask if >> > you've experienced this in the past and know what this may be? >> > >> > At the time it "locks up" my server is still running (hasn't crashed) >> > but >> > obviously I can't connect to it, because its locked somewhere. >> > >> > I'm still debugging, but thought I'd ask :-) >> > >> > thx

>> >> >> I don't know of a bug like this, but that doesn't mean there isn't one. >> >> Using fork() inside any handler for a server is tricky business; that >> might be the cause. fork() can do weird things with cloning filehandles or >> sharing sockets. I'm not exactly sure how Net::WebSocket::Server behaves >> when forked, but it might help to directly close your open file descriptors >> before continuing in the child process. (Which descriptors you close >> depends on your application.) There are some modules that make forking a >> child process safer, like Proc::Daemon, if that suits your needs. If you >> want to just try closing the file descriptors directly without everything >> else Proc::Daemon does, it closes all of your open file descriptors by: >> 1. Getting the highest descriptor id: >> https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L461 >> 2. Looping over them and calling POSIX::close() conditionally >> https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L224 >> >> If that's not the issue, it might help in your debugging to log the value >> of $timeout just before the call to select() - >> https://metacpan.org/source/TOPAZ/Net-WebSocket-Server-0.003004/lib/Net/WebSocket/Server.pm#L157 >> - that would tell you 1. whether the event loop is re-entering the select() >> consistently and 2. whether the amount of time it intends to wait looks >> sane. >> >> Because it's single-threaded, also look out for bugs where one of your >> handlers (or the server code itself, somehow) is blocking on some kind of >> operation. >> >> I'd also check for memory leaks; see if your server process is using >> increasingly larger amounts of memory as it runs over several hours. >> >> If all else fails, you should at least be able to narrow down which >> section of the main loop in start() gets stuck with some carefully-placed >> log lines. >> >> Please let me know what you find; I'd love to fix a bug if I have one (or >> maybe add better documentation about how to avoid issues in callbacks if >> there's a subtle bug in your code somewhere). >>

Sat Nov 30 10:49:27 2019 pliablepixels [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #131058] on_tick stops being called after 7-8 hours of constant operation
Date:	Sat, 30 Nov 2019 10:49:08 -0500
To:	bug-Net-WebSocket-Server [...] rt.cpan.org
From:	Pliable Pixels <pliablepixels [...] gmail.com>

Forgot to mention: First thing I do is terminate the $wss object in my child as well, after your first response. https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2331-L2336

Sun Dec 01 07:47:11 2019 pliablepixels [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #131058] on_tick stops being called after 7-8 hours of constant operation
Date:	Sun, 1 Dec 2019 07:46:48 -0500
To:	bug-Net-WebSocket-Server [...] rt.cpan.org
From:	Pliable Pixels <pliablepixels [...] gmail.com>

My last update: Updated Server.pm with more debug logs: https://gist.github.com/pliablepixels/348281b4445ac044552917970bf4b47b Notes: a) There was no fork launched or active at the time it got stuck. Last fork was over 30 mins ago from the time it got stuck, which is why I feel there is something else going on. b) As I mentioned earlier, in my updated server code, I shut down the websocket server in the client as the first thing https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2336 Final logs (more debug logs) before it got stuck - shows it gets stuck right after select. Doesn't get into the for loop. Sun Dec 1 03:31:26 2019************ core: after foreach ready_write Sun Dec 1 03:31:26 2019************ core: tick period:5 tick_next:1575189086.8804 time: 1575189086.88234 Sun Dec 1 03:31:26 2019************* calling on_tick Sun Dec 1 03:31:26 2019************ core: inside while Sun Dec 1 03:31:26 2019************ core: before select with timeout:-0.00114798545837402 Sun Dec 1 03:31:26 2019************ core: after select Sun Dec 1 03:31:26 2019************ core: before foreach ready_write Sun Dec 1 03:31:26 2019************ core: after foreach ready_write Sun Dec 1 03:31:26 2019************ core: tick period:5 tick_next:1575189091.8804 time: 1575189086.8837 Sun Dec 1 03:31:26 2019************ core: inside while Sun Dec 1 03:31:26 2019************ core: before select with timeout:4.99665689468384 Sun Dec 1 03:31:31 2019************ core: after select Sun Dec 1 03:31:31 2019************ core: before foreach ready_write Sun Dec 1 03:31:31 2019************ core: after foreach ready_write Sun Dec 1 03:31:31 2019************ core: tick period:5 tick_next:1575189091.8804 time: 1575189091.88569 Sun Dec 1 03:31:31 2019************* calling on_tick Sun Dec 1 03:31:31 2019************ core: inside while Sun Dec 1 03:31:31 2019************ core: before select with timeout:4.99346995353699 Sun Dec 1 03:31:36 2019************ core: after select Sun Dec 1 03:31:36 2019************ core: before foreach ready_write Sun Dec 1 03:31:36 2019************ core: after foreach ready_write Sun Dec 1 03:31:36 2019************ core: tick period:5 tick_next:1575189096.8804 time: 1575189096.88233 Sun Dec 1 03:31:36 2019************* calling on_tick Sun Dec 1 03:31:36 2019************ core: inside while Sun Dec 1 03:31:36 2019************ core: before select with timeout:0.000120162963867188 Sun Dec 1 03:31:36 2019************ core: after select Sun Dec 1 03:31:36 2019************ core: before foreach ready_write Sun Dec 1 03:31:36 2019************ core: after foreach ready_write Sun Dec 1 03:31:36 2019************ core: tick period:5 tick_next:1575189101.8804 time: 1575189096.88386 Sun Dec 1 03:31:36 2019************ core: inside while Sun Dec 1 03:31:36 2019************ core: before select with timeout:4.99649381637573 Sun Dec 1 03:31:37 2019************ core: after select - - - stuck - - -

Mon Dec 23 07:25:57 2019 pliablepixels [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #131058] on_tick stops being called after 7-8 hours of constant operation
Date:	Mon, 23 Dec 2019 07:25:36 -0500
To:	bug-Net-WebSocket-Server [...] rt.cpan.org
From:	Pliable Pixels <pliablepixels [...] gmail.com>

Hi we can close this. I completely reworked how I was handling forks and the problem has gone way. My guess is the problem was one of two things: a) I was handling signals incorrectly in the child process b) There were certain cases where a DB handle was not properly cleaned up. Either way, it does not look like there was any problem with the core WSS code. Thank you. Show quoted text

> >

Thu Dec 26 15:39:32 2019 TOPAZ [...] cpan.org - Correspondence added

I've closed this ticket. Sorry for suddenly going dark; my Decembers are extremely busy. I'm glad you were able to figure it out, though. Please let me know if there could have been any useful warnings in the documentation that might help other users in similar situations.

Thu Dec 26 15:39:32 2019 TOPAZ [...] cpan.org - Status changed from 'open' to 'resolved'