Bug #34093 for libwww-perl: LWP::UserAgent requests timing out early at random times

Fri Mar 14 14:16:11 2008 jim_hoagland [...] symantec.com - Ticket created

Subject:

LWP::UserAgent requests timing out early at random times

I'm having a problem with early timeouts from the request method of LWP::UserAgent. Not sure it if matters, but the request is HTTP PUT and to a server on the local network. I'm using the default timeout of 180 seconds, though this has also happened when I explicitly set the timeout to 120 seconds. Sometimes it takes a while for the server to reply (it's doing some database queries), so it might validly take a minute to respond with its result. The problem is that sometimes (fairly frequently when the server is being slow) I'm seeing the request method return with a "500 read timeout" and "Client-Warning: Internal response" well before my requested timeout. I added code to measure and print out the elapsed time and it can be as low as less than one second all the way up to around minute. Seems pretty random how long it's actually waiting for on a given request. (Of course I would never see a timeout for a length of time longer than it takes the server to respond.) I've seen this with version 5.805 on Debian (2.6.18-3-686-bigmem #1 SMP) running perl 5.8.8 and with version 5.803 on Xubuntu (2.6.1526-386 #1 PREEMPT) running perl 5.8.7. Thanks.

Mon Apr 07 05:30:06 2008 GAAS [...] cpan.org - Correspondence added

I have no good idea how this could happen. Perhaps you can try to run your program under strace to try to figure out what happens at the socket level.

Mon Apr 07 05:30:10 2008 The RT System itself - Status changed from 'new' to 'open'

Mon Apr 07 10:22:09 2008 jim_hoagland [...] symantec.com - Correspondence added

Subject:	Re: [rt.cpan.org #34093] LWP::UserAgent requests timing out early at random times
Date:	Mon, 07 Apr 2008 07:21:24 -0700
To:	<bug-libwww-perl [...] rt.cpan.org>
From:	Jim Hoagland <jim_hoagland [...] symantec.com>

Thanks, I'll see if I can try that before going on vacation, otherwise it'll have to wait. I have no evidence this is relevant for this case, but I seem to recall when doing C coding (under Unix) some years back that I had to put a while loop around a socket call (may have been a select()) since it was timing out early. As I recall in that case, the cause was that a signal was occurring and the OS semantics allowed the timeout to end early in that case. If this is the case, then the only thing that could be done is to check if the OS has stopped waiting early (by doing doing time() before and after) and if so, try again with whatever amount of time remains. However, this is something that would have to be done within in LWP::UserAgent::request() call. -- Jim On 4/7/08 2:30 AM, "Gisle_Aas via RT" <bug-libwww-perl@rt.cpan.org> wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=34093 > > > I have no good idea how this could happen. Perhaps you can try to run > your program under strace to try to figure out what happens at the > socket level.

Wed Apr 09 17:58:38 2008 jim_hoagland [...] symantec.com - Correspondence added

From:

jim_hoagland [...] symantec.com

My speculation about a signal-related early timeout is correct. When I first created a simplified version of my script to run strace on, I could not reproduce the early timeout issue. Then I added a fork() where the child would exit after a random number of seconds. I then started seeing the "500 read timeout" very frequently; looking back at the strace, this always occurred immediately after a child exited. The top of my script has: my $reap = 0; #whether to try to reap children $SIG{CHLD} = sub { $reap = 1;}; When I commented out the "$SIG{CHLD}=" line, no 500 read timeouts occurred. The same happened when I set "$SIG{CHLD}='IGNORE'". Putting the original back in place resulted in the early timeouts. I think this pretty conclusively establishes that LWP::UserAgent::request is exiting early if a SIGCHLD occurs and a signal handler is in place to handle it, at least on my systems. I don't know the exact mechanics, but it appears that when returning from the SIGCHLD handler, code execution does not resume inside of the select()'s waiting code, but instead outside of it. It does seem like this case should be handled in LWP::UserAgent (or in socket classes it depends on), in order to preserve the accuracy of the timeout request. I'm not sure yet how I'm going to work around this in the mean time.

Thu Apr 10 05:13:52 2008 GAAS [...] cpan.org - Correspondence added

So it seems like we need to add explicit code to restart our sysread/syswrite calls when they fail with EAGAIN. A workaround for you could be to set the environment variable PERL_SIGNALS to the value 'unsafe' before starting the program. This will restore the old (pre 5.8.1) behaviour where perl pass in SA_RESTART with the signal handler. That should make the read/write system calls restart automatically.

Thu Apr 10 06:08:01 2008 GAAS [...] cpan.org - Correspondence added

On Thu Apr 10 05:13:52 2008, GAAS wrote: Show quoted text

> So it seems like we need to add explicit code to restart our > sysread/syswrite calls when they fail with EAGAIN.

That should have been EINTR, but we should also handle EAGAIN in the same way basically.

Fri Apr 11 07:29:34 2008 GAAS [...] cpan.org - Correspondence added

I've now applied this patch that I believe will fix this issue. It's a bit hard to write tests for this though. The patch will appear in LWP-5.811. http://gitorious.org/projects/libwww-perl/repos/mainline/commits/438845c6a2193bf8fe5a2fd307a3fcec0e7aefbc

Fri Apr 11 07:29:36 2008 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'