Hi Eric, thank you for the detailed explanation.
Unfortunately, now that I've added debug prints everywhere the crash
stopped occurring.
I've decided to write a small program that mimics my operation with your
code without the complexities of my in fork code. I added in the code to
close the fds in the child per your suggestion.
The gist is here
https://gist.github.com/pliablepixels/440e14532beac01af0453f9f7e322519
When the child exits, I see
Warning: unable to close filehandle properly: Bad file descriptor during
global destruction.
Would you know if I'm doing something wrong? It seems the code is closing
an fd is should not. I'm skipping 0-2 (in/out/err) and the code reports the
first fd it closes is 4.
Thanks
On Sat, Nov 23, 2019 at 8:20 PM Eric Wastl via RT <
bug-Net-WebSocket-Server@rt.cpan.org> wrote:
Show quoted text> <URL:
https://rt.cpan.org/Ticket/Display.html?id=131058 >
>
> On Sat Nov 23 06:28:30 2019, pliablepixels@gmail.com wrote:
> > Hi, thanks for the great library. I am using your library to implement
> > an
> > event server of sorts that receives alarm notifications from an open
> > source
> > home security server (ZoneMinder) and does machine learning to detect
> > objects in the alarm feed.
> >
> > The core "loop" of my server relies on your on_tick callback here
> >
>
https://github.com/pliablepixels/zmeventnotification/blob/master/zmeventnotification.pl#L2320-
> > L2351
> >
> > What I've been observing is that after around 7-8 hours of smooth
> > operation, the 'on_tick' handler is just not called. I inserted
> > debugging
> > statements inside /usr/local/share/perl/5.26.1/Net/WebSocket/Server.pm
> > to
> > log messages when it invokes the on_tick and have confirmed it does
> > not.
> > Nor does it exit the start() sub.
> >
> > I've now set it up to log more debug statements after each step
> > (select/etc) inside server.pm but wanted to reach out to you to ask if
> > you've experienced this in the past and know what this may be?
> >
> > At the time it "locks up" my server is still running (hasn't crashed)
> > but
> > obviously I can't connect to it, because its locked somewhere.
> >
> > I'm still debugging, but thought I'd ask :-)
> >
> > thx
>
>
> I don't know of a bug like this, but that doesn't mean there isn't one.
>
> Using fork() inside any handler for a server is tricky business; that
> might be the cause. fork() can do weird things with cloning filehandles or
> sharing sockets. I'm not exactly sure how Net::WebSocket::Server behaves
> when forked, but it might help to directly close your open file descriptors
> before continuing in the child process. (Which descriptors you close
> depends on your application.) There are some modules that make forking a
> child process safer, like Proc::Daemon, if that suits your needs. If you
> want to just try closing the file descriptors directly without everything
> else Proc::Daemon does, it closes all of your open file descriptors by:
> 1. Getting the highest descriptor id:
>
https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L461
> 2. Looping over them and calling POSIX::close() conditionally
>
https://metacpan.org/source/AKREAL/Proc-Daemon-0.23/lib/Proc/Daemon.pm#L224
>
> If that's not the issue, it might help in your debugging to log the value
> of $timeout just before the call to select() -
>
https://metacpan.org/source/TOPAZ/Net-WebSocket-Server-0.003004/lib/Net/WebSocket/Server.pm#L157
> - that would tell you 1. whether the event loop is re-entering the select()
> consistently and 2. whether the amount of time it intends to wait looks
> sane.
>
> Because it's single-threaded, also look out for bugs where one of your
> handlers (or the server code itself, somehow) is blocking on some kind of
> operation.
>
> I'd also check for memory leaks; see if your server process is using
> increasingly larger amounts of memory as it runs over several hours.
>
> If all else fails, you should at least be able to narrow down which
> section of the main loop in start() gets stuck with some carefully-placed
> log lines.
>
> Please let me know what you find; I'd love to fix a bug if I have one (or
> maybe add better documentation about how to avoid issues in callbacks if
> there's a subtle bug in your code somewhere).
>