Skip Menu |

This queue is for tickets about the POE CPAN distribution.

Report information
The Basics
Id: 95072
Status: resolved
Priority: 0/
Queue: POE

People
Owner: Nobody in particular
Requestors: jhuckaby [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: CPU pinned at 100% with POE::Component::Server::TCP and SSLify
Date: Sat, 26 Apr 2014 09:20:45 -0700
To: bug-POE [...] rt.cpan.org
From: Joseph Huckaby <jhuckaby [...] gmail.com>
Hey POE Team, I believe I have found a bug using POE::Component::Server::TCP and SSLify. I have a very simple HTTPS web server (created using the cookbook example), which does nothing except serve up 300K of plain text for every incoming HTTPS GET request. This works fine until it receives multiple requests at once. Then it shoots up to 100% CPU and stays there indefinitely, long after the requests have completed. Here is a snapshot of "top" on my server after sending in 10 simultaneous requests and allowing them to complete fully: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1365 root 20 0 191m 20m 4168 R 99.9 0.6 0:10.95 perl The CPU stays at or around 99.9% indefinitely (I have witnessed several days at least). Also, something to note, even after the CPU is pinned at 100%, the server still accepts new incoming connections, and serves them. Here is a ZIP file containing my script and sample SSL cert: https://dl.dropboxusercontent.com/u/201533/poe-tcp-sslify-bug.zip Here are links to the individual files: https://dl.dropboxusercontent.com/u/201533/poe-tcp-sslify-bug/server.pl https://dl.dropboxusercontent.com/u/201533/poe-tcp-sslify-bug/ssl.crt https://dl.dropboxusercontent.com/u/201533/poe-tcp-sslify-bug/ssl.key I can reproduce the issue every time by sending in 10 simultaneous HTTPS GET requests from a shell script such as this: #!/bin/sh # Fetch URL 10 times simultaneously curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & curl --insecure "$1" >/dev/null 2>&1 & The --insecure flag is to prevent curl from validating the cert (which is self-signed). I am running: POE version 1.358 POE::Component::SSLify version 1.008 This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi Linux pvp.effectgames.com 3.4.82-69.112.amzn1.x86_64 #1 SMP Mon Feb 24 16:31:21 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux openssl-perl-1.0.1g-1.69.amzn1.x86_64 openssl-1.0.1g-1.69.amzn1.x86_64 Other Notes: I cannot reproduce this without SSLify. Meaning, if I strip out all the SSL code and just run a plain HTTP server, everything works fine. Also note that I cannot seem to reproduce this on my local OS X machine hitting localhost, although I suspect that is because localhost is a virtual interface, and all the requests complete "instantly". This bug seems to require a slower network (DSL to Amazon EC2 for example) to occur. If there is anything else you'd like me to try, please let me know. My server is at your disposal! Thanks for your time! - Joe
On Sat Apr 26 12:21:04 2014, jhuckaby@gmail.com wrote: Show quoted text
> Hey POE Team, > > I believe I have found a bug using POE::Component::Server::TCP and > SSLify.
Hi, Joe! I've run your test server and script here, and I'm not seeing high CPU afterwards. Of course I'm using localhost on OS X, and you've already said that doesn't reproduce the problem. Show quoted text
> If there is anything else you'd like me to try, please let me know. > My server is at your disposal!
Spinning can be caused by two things. First, a filehandle is ready for action, but application (or library) code isn't handling the condition. Second, something is posting messages just as fast as they're being consumed, forever. An strace (or equivalent) report would quickly tell us what's going on under the hood.
Subject: Re: [rt.cpan.org #95072] CPU pinned at 100% with POE::Component::Server::TCP and SSLify
Date: Sun, 27 Apr 2014 10:13:24 -0700
To: bug-POE [...] rt.cpan.org
From: Joseph Huckaby <jhuckaby [...] gmail.com>
Hey Rocco! Thanks for the quick reply! Here is the strace you asked for. I let it run for several seconds after 10 HTTPS requests completed, with CPU confirmed at 100%: https://dl.dropboxusercontent.com/u/201533/poe-tcp-sslify-bug/trace.out.gz Looks like it's getting stuck in some kind of date/time/select loop: select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999931}) = 6 (out [7 8 9 10 11 13], left {3599, 999927}) gettimeofday({1398618191, 49675}, NULL) = 0 gettimeofday({1398618191, 50301}, NULL) = 0 gettimeofday({1398618191, 50370}, NULL) = 0 gettimeofday({1398618191, 50438}, NULL) = 0 select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999932}) = 6 (out [7 8 9 10 11 13], left {3599, 999929}) gettimeofday({1398618191, 50605}, NULL) = 0 gettimeofday({1398618191, 51296}, NULL) = 0 gettimeofday({1398618191, 51367}, NULL) = 0 gettimeofday({1398618191, 51435}, NULL) = 0 select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999932}) = 6 (out [7 8 9 10 11 13], left {3599, 999928}) gettimeofday({1398618191, 51604}, NULL) = 0 gettimeofday({1398618191, 52235}, NULL) = 0 gettimeofday({1398618191, 52304}, NULL) = 0 gettimeofday({1398618191, 52372}, NULL) = 0 select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999932}) = 6 (out [7 8 9 10 11 13], left {3599, 999928}) gettimeofday({1398618191, 52540}, NULL) = 0 gettimeofday({1398618191, 53166}, NULL) = 0 gettimeofday({1398618191, 53235}, NULL) = 0 gettimeofday({1398618191, 53303}, NULL) = 0 - Joe On Apr 27, 2014, at 9:36 AM, "Rocco Caputo via RT" <bug-POE@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=95072 > > > On Sat Apr 26 12:21:04 2014, jhuckaby@gmail.com wrote:
>> Hey POE Team, >> >> I believe I have found a bug using POE::Component::Server::TCP and >> SSLify.
> > Hi, Joe! > > I've run your test server and script here, and I'm not seeing high CPU afterwards. Of course I'm using localhost on OS X, and you've already said that doesn't reproduce the problem. >
>> If there is anything else you'd like me to try, please let me know. >> My server is at your disposal!
> > Spinning can be caused by two things. First, a filehandle is ready for action, but application (or library) code isn't handling the condition. Second, something is posting messages just as fast as they're being consumed, forever. > > An strace (or equivalent) report would quickly tell us what's going on under the hood.
On Sun Apr 27 13:13:41 2014, jhuckaby@gmail.com wrote: Show quoted text
> > Looks like it's getting stuck in some kind of date/time/select loop: > > select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999931}) > = 6 (out [7 8 9 10 11 13], left {3599, 999927}) > gettimeofday({1398618191, 49675}, NULL) = 0 > gettimeofday({1398618191, 50301}, NULL) = 0 > gettimeofday({1398618191, 50370}, NULL) = 0 > gettimeofday({1398618191, 50438}, NULL) = 0 > select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999932}) > = 6 (out [7 8 9 10 11 13], left {3599, 999929})
This illustrates the spin condition where file descriptors are ready but nothing is handling them. There are no read, write, shutdown, or close calls to stop those descriptors from being ready forever. I agree with you, it sounds like a race condition, and it may be easier to reproduce over a slower connection. I'll pursue that to reproduce it here. Meanwhile, try running the server with POE's assertions enabled: POE_ASSERT_DEFAULT=1 perl ./server.pl (run the client) (Does anything explode on the server side?)
Subject: Re: [rt.cpan.org #95072] CPU pinned at 100% with POE::Component::Server::TCP and SSLify
Date: Sun, 27 Apr 2014 11:43:01 -0700
To: bug-POE [...] rt.cpan.org
From: Joseph Huckaby <jhuckaby [...] gmail.com>
Hey Rocco, POE_ASSERT_DEFAULT doesn't seem to cause any server explosions. It behaves exactly like it did before. Update: I was finally able to reproduce the bug with the server script running on my Mac. I just had to open a port in my router and hit it from the outside world (i.e. that slow connection thing again). OS X gets pinned to 100% CPU indefinitely just like my Linux server did. I tried to find an strace alternative on OS X and found people talking about "dtruss", but it doesn't seem to work properly for me. The script exits after a few seconds before I even have a chance to send it any URLs. I did try Apple's Activity Monitor and its "Sample Process" option. I sampled it while pinned at 100% CPU and saved that report off as a text file if you want to see: https://dl.dropboxusercontent.com/u/201533/poe-tcp-sslify-bug/osx-sample-perl5.16.txt If you want me to grab any other debugging info on OS X, please provide instructions and I'd be happy to do it. - Joe On Apr 27, 2014, at 10:48 AM, "Rocco Caputo via RT" <bug-POE@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=95072 > > > On Sun Apr 27 13:13:41 2014, jhuckaby@gmail.com wrote:
>> >> Looks like it's getting stuck in some kind of date/time/select loop: >> >> select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999931}) >> = 6 (out [7 8 9 10 11 13], left {3599, 999927}) >> gettimeofday({1398618191, 49675}, NULL) = 0 >> gettimeofday({1398618191, 50301}, NULL) = 0 >> gettimeofday({1398618191, 50370}, NULL) = 0 >> gettimeofday({1398618191, 50438}, NULL) = 0 >> select(16, [3 4 7 8 9 10 11 13], [7 8 9 10 11 13], [], {3599, 999932}) >> = 6 (out [7 8 9 10 11 13], left {3599, 999929})
> > This illustrates the spin condition where file descriptors are ready but nothing is handling them. There are no read, write, shutdown, or close calls to stop those descriptors from being ready forever. > > I agree with you, it sounds like a race condition, and it may be easier to reproduce over a slower connection. I'll pursue that to reproduce it here. > > Meanwhile, try running the server with POE's assertions enabled: > > POE_ASSERT_DEFAULT=1 perl ./server.pl > > (run the client) > (Does anything explode on the server side?) >
On Sun Apr 27 14:43:18 2014, jhuckaby@gmail.com wrote: Show quoted text
> > POE_ASSERT_DEFAULT doesn't seem to cause any server explosions. It > behaves exactly like it did before.
I've seen this elsewhere recently. It seems to be a race condition in POE::Component::SSLify, where the underlying I/O remains ready but attempts to read through that other module don't clear the condition. I suppose something to check is whether the problem goes away when POE::Component::SSLify is removed from the stack. I'm not sure there's anything I can do on the POE level to remedy that.
Subject: Re: [rt.cpan.org #95072] CPU pinned at 100% with POE::Component::Server::TCP and SSLify
Date: Mon, 7 Jul 2014 22:58:58 -0700
To: bug-POE [...] rt.cpan.org
From: Joseph Huckaby <jhuckaby [...] gmail.com>
Hey Rocco, Confirmed, the bug does not occur without POE::Component::SSLify in the stack. I sent a ticket to the SSLify author several months ago, but no reply, alas. https://rt.cpan.org/Public/Bug/Display.html?id=95071 Thanks for your time on this. - Joe On Mon, Jul 7, 2014 at 7:51 PM, Rocco Caputo via RT <bug-POE@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=95072 > > > On Sun Apr 27 14:43:18 2014, jhuckaby@gmail.com wrote:
> > > > POE_ASSERT_DEFAULT doesn't seem to cause any server explosions. It > > behaves exactly like it did before.
> > I've seen this elsewhere recently. It seems to be a race condition in > POE::Component::SSLify, where the underlying I/O remains ready but attempts > to read through that other module don't clear the condition. > > I suppose something to check is whether the problem goes away when > POE::Component::SSLify is removed from the stack. > > I'm not sure there's anything I can do on the POE level to remedy that. >
Resolving. Confirmed not a bug in the POE distribution, and a bug was opened for the culprit.