Bug #108390 for Test-Harness: Certain output sequences combined with shared-resource-aware tests reliably deadlock `prove -jN`

Subject:

Certain output sequences combined with shared-resource-aware tests reliably deadlock `prove -jN`

Sometimes individual tests in a suite compete for the same resource which only allows one-at-a-time access. Such tests are usually equipped with a shared lock mechanism allowing them to properly avoid contention. Such a setup can deadlock a parallel harness run, in case two lock-aware tests are executed at the same time, produce some STDERR output, and then the one acquiring the lock *first* happens to produce a lot of STDOUT output (on my linux the magic number is > 128k). This is best demonstrated with a very simple test case, by executing the attached file *twice* as such: # the ./ is to prevent the harness from complaining HARNESS_OPTIONS=j2 perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )' Below is the interesting strace portion as captured by HARNESS_OPTIONS=j2 strace -tttfo strc perl -MTest::Harness -e 'Test::Harness::runtests( qw( ./test.t test.t ) )' While the thing was deadlocked the handles were distributed as follows: /proc/1094/fd: lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 0 -> /dev/pts/10 lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 1 -> /dev/pts/10 lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 2 -> /dev/pts/10 lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 3 -> pipe:[1425076] lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 4 -> pipe:[1425079] lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 5 -> pipe:[1425077] lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 7 -> pipe:[1424290] l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 8 -> pipe:[1163262] /proc/1111/fd: lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 0 -> /dev/pts/10 l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 1 -> pipe:[1425076] l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 2 -> pipe:[1425077] lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 3 -> /home/rabbit/devel/tapwtf/test.t l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 8 -> pipe:[1163262] /proc/1112/fd: lrwx------ 1 rabbit rabbit 64 Nov 2 20:48 0 -> /dev/pts/10 l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 1 -> pipe:[1425079] l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 2 -> pipe:[1424290] lr-x------ 1 rabbit rabbit 64 Nov 2 20:48 3 -> /home/rabbit/devel/tapwtf/test.t l-wx------ 1 rabbit rabbit 64 Nov 2 20:48 8 -> pipe:[1163262] Pid 1094 is the harness, 1111 and 1112 are the test subprocesses. Note how the harness hangs on a select, only looking at 1111 (the process waiting for the lock to be released), entirely ignoring 1112's handles. ... 1112 1446493701.800355 nanosleep({0, 199664000}, NULL) = 0 1111 1446493702.000280 <... nanosleep resumed> NULL) = 0 1112 1446493702.000400 write(2, "\n\n1112 acquiring lock\n\n", 23 <unfinished ...> 1111 1446493702.000433 write(2, "\n\n1111 acquiring lock\n\n", 23 <unfinished ...> 1112 1446493702.000505 <... write resumed> ) = 23 1094 1446493702.000528 <... select resumed> ) = 2 (in [5 7]) 1112 1446493702.000563 lseek(3, 671, SEEK_SET <unfinished ...> 1111 1446493702.000576 <... write resumed> ) = 23 1112 1446493702.000592 <... lseek resumed> ) = 671 1111 1446493702.000607 lseek(3, 671, SEEK_SET <unfinished ...> 1112 1446493702.000624 lseek(3, 0, SEEK_CUR <unfinished ...> 1111 1446493702.000636 <... lseek resumed> ) = 671 1112 1446493702.000650 <... lseek resumed> ) = 671 1111 1446493702.000660 lseek(3, 0, SEEK_CUR <unfinished ...> 1112 1446493702.000688 flock(3, LOCK_EX <unfinished ...> 1111 1446493702.000703 <... lseek resumed> ) = 671 1112 1446493702.000731 <... flock resumed> ) = 0 1111 1446493702.000741 flock(3, LOCK_EX <unfinished ...> 1112 1446493702.001010 write(1, "ok 00000001 Stuff 32 bytes long\n"..., 8192) = 8192 1094 1446493702.001206 times({tms_utime=18, tms_stime=2, tms_cutime=0, tms_cstime=0}) = 1724715580 1112 1446493702.001340 write(1, "ok 00000257 Stuff 32 bytes long\n"..., 8192) = 8192 1094 1446493702.001453 select(8, [3 5], NULL, NULL, NULL) = 1 (in [5]) 1094 1446493702.001601 read(5, "\n\n1111 acquiring lock\n\n", 65536) = 23 1094 1446493702.001663 brk(0x18db000) = 0x18db000 1094 1446493702.001718 write(2, "\n\n1111 acquiring lock\n\n", 23 <unfinished ...> 1112 1446493702.001731 write(1, "ok 00000513 Stuff 32 bytes long\n"..., 8192 <unfinished ...> 1094 1446493702.001750 <... write resumed> ) = 23 1112 1446493702.001760 <... write resumed> ) = 8192 1094 1446493702.001783 select(8, [3 5], NULL, NULL, NULL <unfinished ...> 1112 1446493702.002034 write(1, "ok 00000769 Stuff 32 bytes long\n"..., 8192) = 8192 1112 1446493702.002363 write(1, "ok 00001025 Stuff 32 bytes long\n"..., 8192) = 8192 1112 1446493702.002705 write(1, "ok 00001281 Stuff 32 bytes long\n"..., 8192) = 8192 1112 1446493702.003052 write(1, "ok 00001537 Stuff 32 bytes long\n"..., 8192) = 8192 1112 1446493702.003435 write(1, "ok 00001793 Stuff 32 bytes long\n"..., 8192) = 8192 1112 1446493702.003785 write(1, "ok 00002049 Stuff 32 bytes long\n"..., 8192 <unfinished ...> 1111 1446493721.777774 <... flock resumed> ) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 1111 1446493721.777840 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- ...

Subject:

test.t

use Fcntl ':flock'; use Time::HiRes qw( time sleep ); # kill everything after some time $SIG{ALRM} = sub { kill 'KILL', 0 }; alarm( 20 ); # start both tests on the "verge of a second" sleep (int(time) + 1 - time ); # Writing something to stderr is crucial - without this no deadlock will occur print STDERR "\n\n$$ acquiring lock\n\n"; # Simply wait for a shared resource before continuing # # not using Fcntl to make strace easier to read # may not work on non-linux OS-es flock( *DATA, LOCK_EX ) or die "Unable to lock: $!"; my $max = 4100; # 4000 works on my system printf "ok %08d Stuff 32 bytes long\n", $_ for 1..$max; print "1..$max\n"; exit 0; __DATA__ Lock lock - who's there?