Bug #113853 for Unix-Uptime: Race condition in t/003-load.t

Mon Apr 18 03:33:42 2016 SREZIC [...] cpan.org - Ticket created

Subject:

Race condition in t/003-load.t

One of my smoker machines reported the following failure: # Failed test at t/003-load.t line 19. # got: '3.16' # expected: '3.06' # Failed test at t/003-load.t line 20. # got: '6.00' # expected: '5.94' # Looks like you failed 2 tests of 9. t/003-load.t .......... Dubious, test returned 2 (wstat 512, 0x200) Failed 2/9 subtests Looking at CPAN Testers results, it seems that this failure happens occasionally also on other systems: http://matrix.cpantesters.org/?dist=Unix-Uptime%200.4000 I think we see a race condition in t/003-load.t: you take the load averages twice, first through the perl module and then using the system command. Unfortunately time passes in between and the load average information may be updated when using the system command, so the values may differ.

Mon Apr 18 03:37:41 2016 SREZIC [...] cpan.org - Correspondence added

On 2016-04-18 03:33:42, SREZIC wrote: Show quoted text

> One of my smoker machines reported the following failure: > > # Failed test at t/003-load.t line 19. > # got: '3.16' > # expected: '3.06' > > # Failed test at t/003-load.t line 20. > # got: '6.00' > # expected: '5.94' > # Looks like you failed 2 tests of 9. > t/003-load.t .......... > Dubious, test returned 2 (wstat 512, 0x200) > Failed 2/9 subtests > > Looking at CPAN Testers results, it seems that this failure happens > occasionally also on other systems: > http://matrix.cpantesters.org/?dist=Unix-Uptime%200.4000 > > I think we see a race condition in t/003-load.t: you take the load > averages twice, first through the perl module and then using the > system command. Unfortunately time passes in between and the load > average information may be updated when using the system command, so > the values may differ.

Attached a possible fix.

Subject:

0001-avoid-race-condition-in-load-average-test-RT-113853.patch

From f82703b9e3cb8fb03fd2331e6739cc86b59b812e Mon Sep 17 00:00:00 2001 From: Slaven Rezic <slaven@rezic.de> Date: Mon, 18 Apr 2016 09:36:48 +0200 Subject: [PATCH] avoid race condition in load average test (RT #113853) --- t/003-load.t | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/t/003-load.t b/t/003-load.t index 19196c8..6889765 100644 --- a/t/003-load.t +++ b/t/003-load.t @@ -2,20 +2,26 @@ use warnings; use strict; -use Test::More tests => 9; +use Test::More tests => 6; use Unix::Uptime; $ENV{LC_ALL} = 'C'; -ok my ($load1, $load5, $load15) = Unix::Uptime->load(), 'received a load avarage'; +my ($load1, $load5, $load15); +my ($pload1, $pload5, $pload15); + +for (1..2) { + ($load1, $load5, $load15) = Unix::Uptime->load(); + my $pretty_uptime = `uptime`; + ($pload1, $pload5, $pload15) = $pretty_uptime =~ /load\s+averages?:\s+(\d+\.?\d*),?\s+(\d+\.?\d*),?\s+(\d+\.?\d*)/i; + last if ($load1 == $pload1 && $load5 == $pload5 && $load15 == $pload15); +} + like $load1, qr/^\d+(\.\d+)?$/, 'load1 looks right'; like $load5, qr/^\d+(\.\d+)?$/, 'load5 looks right'; like $load15, qr/^\d+(\.\d+)?$/, 'load15 looks right'; -ok my $pretty_uptime = `uptime`; -ok my ($pload1, $pload5, $pload15) = $pretty_uptime =~ /load\s+averages?:\s+(\d+\.?\d*),?\s+(\d+\.?\d*),?\s+(\d+\.?\d*)/i - or diag "\$ uptime\n$pretty_uptime"; is $load1, $pload1; is $load5, $pload5; is $load15, $pload15; -- 2.1.4

Mon Apr 18 06:57:28 2016 PIOTO [...] cpan.org - Correspondence added

On Mon Apr 18 03:37:41 2016, SREZIC wrote: Show quoted text

> On 2016-04-18 03:33:42, SREZIC wrote:

> > One of my smoker machines reported the following failure: > > > > # Failed test at t/003-load.t line 19. > > # got: '3.16' > > # expected: '3.06' > > > > # Failed test at t/003-load.t line 20. > > # got: '6.00' > > # expected: '5.94' > > # Looks like you failed 2 tests of 9. > > t/003-load.t .......... > > Dubious, test returned 2 (wstat 512, 0x200) > > Failed 2/9 subtests > > > > Looking at CPAN Testers results, it seems that this failure happens > > occasionally also on other systems: > > http://matrix.cpantesters.org/?dist=Unix-Uptime%200.4000 > > > > I think we see a race condition in t/003-load.t: you take the load > > averages twice, first through the perl module and then using the > > system command. Unfortunately time passes in between and the load > > average information may be updated when using the system command, so > > the values may differ.

> > Attached a possible fix.

Unfortunately, while I agree that there are problems with the current test, and I've seen the failures you've mentioned, I think that your change weakens the confidence that the tests give me that I'm actually doing something sensible. Now it's just checking that the code spits out the same numbers twice in a row. I'd rather have something like a "fuzzy match" against the values parsed from `uptime`... perhaps some sort of "equals, plus or minus 1.0" test could help improve the situation instead? Anyways, this reminds me that I need up update the bugtracker metadata on this module... at this point, I'd rather have bugs & pull requests submitted on GitHub: https://github.com/pioto/Unix-Uptime

Mon Apr 18 06:57:28 2016 The RT System itself - Status changed from 'new' to 'open'

Mon Apr 18 14:50:11 2016 SREZIC [...] cpan.org - Correspondence added

On 2016-04-18 06:57:28, PIOTO wrote: Show quoted text

> On Mon Apr 18 03:37:41 2016, SREZIC wrote:

> > On 2016-04-18 03:33:42, SREZIC wrote:

> > > One of my smoker machines reported the following failure: > > > > > > # Failed test at t/003-load.t line 19. > > > # got: '3.16' > > > # expected: '3.06' > > > > > > # Failed test at t/003-load.t line 20. > > > # got: '6.00' > > > # expected: '5.94' > > > # Looks like you failed 2 tests of 9. > > > t/003-load.t .......... > > > Dubious, test returned 2 (wstat 512, 0x200) > > > Failed 2/9 subtests > > > > > > Looking at CPAN Testers results, it seems that this failure happens > > > occasionally also on other systems: > > > http://matrix.cpantesters.org/?dist=Unix-Uptime%200.4000 > > > > > > I think we see a race condition in t/003-load.t: you take the load > > > averages twice, first through the perl module and then using the > > > system command. Unfortunately time passes in between and the load > > > average information may be updated when using the system command, > > > so > > > the values may differ.

> > > > Attached a possible fix.

> > Unfortunately, while I agree that there are problems with the current > test, and I've seen the failures you've mentioned, I think that your > change weakens the confidence that the tests give me that I'm actually > doing something sensible. Now it's just checking that the code spits > out the same numbers twice in a row.

This is not what the change does. The condition within the loop does a "pre-check" if the values returned by the system command and by Unix::Uptime are the same. If they are, then the loop will be prematurely ended and after that six of your unaltered test cases are run. Three test cases were removed, but I think these are not really needed: * the check that Unix::Uptime->load() returns a non-empty list --- the test script will fail anyway if any of $load1..$load15 are missing (because of the floating number check) * the checks that the system uptime returns anything and may be correctly parsed --- if these would fail, then the later "is $load1, $pload1" etc. checks would also fail. And what if the condition in the loop is false? Then uptime and Unix::Uptime are immediately called again. My assumption is that load averages are calculated not so often, and https://en.wikipedia.org/wiki/Load_%28computing%29#Reckoning_CPU_load says it happens every five seconds on typical linux systems, which I can confirm on a linux and a non-linux (freebsd) system. So if the two calls happen within five seconds of the last call, then the values should be the same. This should be enough even on loaded/slow systems. Show quoted text

> I'd rather have something like a "fuzzy match" against the values > parsed from `uptime`... perhaps some sort of "equals, plus or minus > 1.0" test could help improve the situation instead?

+/-1.0 would make a big difference if it's 0.0 vs 1.0 on a single-cpu system. On the other hand, on multi-cpu machines it's possible that the load average drops/increases by a larger delta within five seconds. Show quoted text

> Anyways, this reminds me that I need up update the bugtracker metadata > on this module... at this point, I'd rather have bugs & pull requests > submitted on GitHub: https://github.com/pioto/Unix-Uptime

"git am 0001-avoid-race-condition-in-load-average-test-RT-113853.patch" will give you a git commit with full attribution.