> This is looking more like an underlying system problem as it only
> appears to happen on RHEL 6/CentOS 6 systems.
>
> I've also validated that the application code never does a blocking
> send (added a die if we try to be sure)
> {
> no warnings 'redefine';
> *Net::DNS::Resolver::Base::send = sub { die "Ensure we never
> do a foreground send because it can block forever"; };
> *Net::DNS::Resolver::Base::_send_tcp = sub { die "Ensure we never
> do a foreground send because it can block forever"; };
> *Net::DNS::Resolver::Base::_send_udp = sub { die "Ensure we never
> do a foreground send because it can block forever"; };
> }
>
>
> I'm continuing to dig in on this and will update this case when I have
> more information.
>
>
> On Wed Jul 05 12:19:49 2017, rwfranks@acm.org wrote:
> > On Tue Jul 04 15:07:29 2017, bdraco wrote:
> > > On Tue Jul 04 10:52:08 2017, rwfranks@acm.org wrote:
> > > > On Mon Jul 03 18:49:08 2017, bdraco wrote:
> > > > > ->bgsend calls ->_bgsend_udp
> > > > >
> > > > > which calls
> > > > >
> > > > > ->send and gets $ans->header->tc
> > > > >
> > > > > So it calls _send_tcp and it won't have blocking(0) set on the
> > > > > socket
> > > > > so it will stall forever if it doesn't get a response.
> > > >
> > > > What version of Net::DNS are you using?
> > > >
> > > > Have you tried using latest version (1.11)?
> > > >
> > > > Also $resolver->tcp_timeout needs to be set to a reasonable value
> > > > for
> > > > what you are trying to do. (The default is _very_ long; there is
> > > > no
> > > > one value that will satisfy everybody).
> > >
> > > Even with 1.11 it never times out because its doing the call with a
> > > blocking file handler and recvfrom() never returns.
> > >
> > > strace -p 515453
> > > Process 515453 attached
> > > recvfrom(9,
> > >
> > > ...hang....
> > >
> > > #0 0x00007f5f6a22dc83 in __recvfrom_nocancel () from
> > > /lib64/libpthread.so.0
> > > #1 0x00007f5f6a52bd53 in Perl_pp_sysread () from
> > > /usr/local/cpanel/3rdparty/perl/524/lib64/perl5/5.24.1/x86_64-
> > > linux-
> > > 64int/CORE/libperl.so
> > > #2 0x00007f5f6a4ec715 in Perl_runops_standard () from
> > > /usr/local/cpanel/3rdparty/perl/524/lib64/perl5/5.24.1/x86_64-
> > > linux-
> > > 64int/CORE/libperl.so
> > > #3 0x00007f5f6a48dc3d in perl_run () from
> > > /usr/local/cpanel/3rdparty/perl/524/lib64/perl5/5.24.1/x86_64-
> > > linux-
> > > 64int/CORE/libperl.so
> > > #4 0x0000000000475493 in main ()
> >
> >
> > You claim that bgsend() blocks at the TCP socket->send()
> >
> > $ cat test.pl
> > #!/usr/bin/perl
> > #
> > use 5.24.1;
> > use Net::DNS 1.11;
> >
> > my $resolver = new Net::DNS::Resolver(
> > debug => 0,
> > nameserver => '185.49.140.63',
> > udp_timeout => 10,
> > tcp_timeout => 10,
> > usevc => 0,
> > );
> >
> > my $handle = $resolver->bgsend(qw(net-dns.org DNSKEY IN));
> >
> > while ( $resolver->bgbusy($handle) ) {
> > print "$handle not blocked: socktype ", $handle->socktype(),
> > "\n";
> > select( undef, undef, undef, 0.005 ); # limit CPU burn
> > }
> >
> > my $packet = $resolver->bgread($handle);
> > print 'answer from ', $resolver->answerfrom(), "\n" if $packet;
> >
> > exit;
> >
> >
> > $ perl -w test.pl
> > IO::Socket::IP=GLOB(0x8fac718) not blocked: socktype 2
> > IO::Socket::IP=GLOB(0x8fac718) not blocked: socktype 2
> > IO::Socket::IP=GLOB(0x8fac718) not blocked: socktype 2
> > IO::Socket::IP=GLOB(0x8fac718) not blocked: socktype 2
> > IO::Socket::IP=GLOB(0x8fac718) not blocked: socktype 2
> > IO::Socket::IP=GLOB(0x8fac718) not blocked: socktype 2
> > IO::Socket::IP=GLOB(0x8fad1e8) not blocked: socktype 1
> > IO::Socket::IP=GLOB(0x8fad1e8) not blocked: socktype 1
> > IO::Socket::IP=GLOB(0x8fad1e8) not blocked: socktype 1
> > IO::Socket::IP=GLOB(0x8fad1e8) not blocked: socktype 1
> > IO::Socket::IP=GLOB(0x8fad1e8) not blocked: socktype 1
> > answer from 185.49.140.63
> >
> > This does not appear to be supported by experimental evidence.
> >
> > If this test is not repeatable on your system, the local TCP probably
> > does not support non-blocking sockets.
> >
> > If the test gives a similar result, the problem lies in your
> > application code.
> >
> > If you disagree, please provide a _small_ counter-example which
> > supports your argument.
Its looking more like a kernel bug as the problem only happens if the file handle was previously used to read netlink data from the kernel (even after close)
303262 recvmsg(10, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\36V^Y\236\240\4\0\0\0\0\0", 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
303262 close(10) = 0
--- Used for netlink above ----
--- Now 10 is reused for TCP ---
303262 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
303262 ioctl(10, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffcbd3b7280) = -1 EINVAL (Invalid argument)
303262 lseek(10, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
303262 ioctl(10, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffcbd3b7280) = -1 EINVAL (Invalid argument)
303262 lseek(10, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
303262 fcntl(10, F_SETFD, FD_CLOEXEC) = 0
303262 bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
303262 fcntl(10, F_GETFL) = 0x2 (flags O_RDWR)
303262 fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
303262 connect(10, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("121.51.128.164")}, 16) = -1 EINPROGRESS (Operation now in progress)
303262 select(16, NULL, [10], [10], {3, 0}) = 1 (left {1, 777870})
303262 getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
303262 fcntl(10, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
303262 fcntl(10, F_SETFL, O_RDWR) = 0
303262 getpeername(10, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("121.51.128.164")}, [16]) = 0
303262 getpeername(10, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("121.51.128.164")}, [16]) = 0
303262 sendto(10, "\0/~,\0\0\0\1\0\0\0\0\0\1\7webmail\6szsqcn\3com\0\0\1\0\1\0\0)\4\0\0\0\0\0\0\0", 49, 0, NULL, 0) = 49
----
SNIP
----
303262 select(16, [8 9 10 11 12], NULL, NULL, {3, 0}) = 5 (in [8 9 10 11 12], left {2, 999995})
303262 select(16, [10], NULL, NULL, {0, 200000}) = 1 (in [10], left {0, 199998})
303262 select(16, [10], NULL, NULL, {0, 200000}) = 1 (in [10], left {0, 199998})
303262 select(16, [10], NULL, NULL, {0, 0}) = 1 (in [10], left {0, 0})
303262 recvfrom(10, "\0|", 2, 0, {sa_family=0x7465 /* AF_??? */, sa_data="h0:cp15\0\0\0@\0\0\0"}, [0]) = 2
303262 recvfrom(10, <== STALL