Subject: | Certain output sequences combined with shared-resource-aware tests reliably deadlock `prove -jN` |
Sometimes individual tests in a suite compete for the same resource which only allows one-at-a-time access. Such tests are usually equipped with a shared lock mechanism allowing them to properly avoid contention. Such a setup can deadlock a parallel harness run, in case two lock-aware tests are executed at the same time, produce some STDERR output, and then the one acquiring the lock *first* happens to produce a lot of STDOUT output (on my linux the magic number is > 128k).
This is best demonstrated with a very simple test case, by executing the attached file *twice* as such:
# the ./ is to prevent the harness from complaining
HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'
Below is the interesting strace portion as captured by
HARNESS_OPTIONS=j2 strace -tttfo strc perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )'
While the thing was deadlocked the handles were distributed as follows:
/proc/1094/fd:
lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 0 -> /dev/pts/10
lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 1 -> /dev/pts/10
lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 2 -> /dev/pts/10
lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 3 -> pipe:[1425076]
lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 4 -> pipe:[1425079]
lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 5 -> pipe:[1425077]
lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 7 -> pipe:[1424290]
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 8 -> pipe:[1163262]
/proc/1111/fd:
lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 0 -> /dev/pts/10
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 1 -> pipe:[1425076]
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 2 -> pipe:[1425077]
lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 3 -> /home/rabbit/devel/tapwtf/test.t
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 8 -> pipe:[1163262]
/proc/1112/fd:
lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 0 -> /dev/pts/10
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 1 -> pipe:[1425079]
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 2 -> pipe:[1424290]
lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 3 -> /home/rabbit/devel/tapwtf/test.t
l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 8 -> pipe:[1163262]
Pid 1094 is the harness, 1111 and 1112 are the test subprocesses. Note how the harness hangs on a select, only looking at 1111 (the process waiting for the lock to be released), entirely ignoring 1112's handles.
...
1112 1446493701.800355 nanosleep({0, 199664000}, NULL) = 0
1111 1446493702.000280 <... nanosleep resumed> NULL) = 0
1112 1446493702.000400 write(2, "\n\n1112 acquiring lock\n\n", 23 <unfinished ...>
1111 1446493702.000433 write(2, "\n\n1111 acquiring lock\n\n", 23 <unfinished ...>
1112 1446493702.000505 <... write resumed> ) = 23
1094 1446493702.000528 <... select resumed> ) = 2 (in [5 7])
1112 1446493702.000563 lseek(3, 671, SEEK_SET <unfinished ...>
1111 1446493702.000576 <... write resumed> ) = 23
1112 1446493702.000592 <... lseek resumed> ) = 671
1111 1446493702.000607 lseek(3, 671, SEEK_SET <unfinished ...>
1112 1446493702.000624 lseek(3, 0, SEEK_CUR <unfinished ...>
1111 1446493702.000636 <... lseek resumed> ) = 671
1112 1446493702.000650 <... lseek resumed> ) = 671
1111 1446493702.000660 lseek(3, 0, SEEK_CUR <unfinished ...>
1112 1446493702.000688 flock(3, LOCK_EX <unfinished ...>
1111 1446493702.000703 <... lseek resumed> ) = 671
1112 1446493702.000731 <... flock resumed> ) = 0
1111 1446493702.000741 flock(3, LOCK_EX <unfinished ...>
1112 1446493702.001010 write(1, "ok 00000001 Stuff 32 bytes long\n"..., 8192) = 8192
1094 1446493702.001206 times({tms_utime=18, tms_stime=2, tms_cutime=0, tms_cstime=0}) = 1724715580
1112 1446493702.001340 write(1, "ok 00000257 Stuff 32 bytes long\n"..., 8192) = 8192
1094 1446493702.001453 select(8, [3 5], NULL, NULL, NULL) = 1 (in [5])
1094 1446493702.001601 read(5, "\n\n1111 acquiring lock\n\n", 65536) = 23
1094 1446493702.001663 brk(0x18db000) = 0x18db000
1094 1446493702.001718 write(2, "\n\n1111 acquiring lock\n\n", 23 <unfinished ...>
1112 1446493702.001731 write(1, "ok 00000513 Stuff 32 bytes long\n"..., 8192 <unfinished ...>
1094 1446493702.001750 <... write resumed> ) = 23
1112 1446493702.001760 <... write resumed> ) = 8192
1094 1446493702.001783 select(8, [3 5], NULL, NULL, NULL <unfinished ...>
1112 1446493702.002034 write(1, "ok 00000769 Stuff 32 bytes long\n"..., 8192) = 8192
1112 1446493702.002363 write(1, "ok 00001025 Stuff 32 bytes long\n"..., 8192) = 8192
1112 1446493702.002705 write(1, "ok 00001281 Stuff 32 bytes long\n"..., 8192) = 8192
1112 1446493702.003052 write(1, "ok 00001537 Stuff 32 bytes long\n"..., 8192) = 8192
1112 1446493702.003435 write(1, "ok 00001793 Stuff 32 bytes long\n"..., 8192) = 8192
1112 1446493702.003785 write(1, "ok 00002049 Stuff 32 bytes long\n"..., 8192 <unfinished ...>
1111 1446493721.777774 <... flock resumed> ) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
1111 1446493721.777840 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
...
Subject: | test.t |
use Fcntl ':flock';
use Time::HiRes qw( time sleep );
# kill everything after some time
$SIG{ALRM} = sub { kill 'KILL', 0 };
alarm( 20 );
# start both tests on the "verge of a second"
sleep (int(time) + 1 - time );
# Writing something to stderr is crucial - without this no deadlock will occur
print STDERR "\n\n$$ acquiring lock\n\n";
# Simply wait for a shared resource before continuing
#
# not using Fcntl to make strace easier to read
# may not work on non-linux OS-es
flock( *DATA, LOCK_EX ) or die "Unable to lock: $!";
my $max = 4100; # 4000 works on my system
printf "ok %08d Stuff 32 bytes long\n", $_ for 1..$max;
print "1..$max\n";
exit 0;
__DATA__
Lock lock - who's there?