Bug #114318 for Sys-Statistics-Linux: Sys::Statistics::Linux::CpuStats

Wed May 11 03:00:57 2016 SIROW [...] cpan.org - Ticket created

Subject:

Sys::Statistics::Linux::CpuStats - Negative zero idle-time

Greetings, some of our servers are having issues with the output of Sys::Statistics::Linux::CpuStats when having long idle-times. Details: -------- We are monitoring our servers cpu-usage via the nrpe_cpu Icinga 2 plugin. When our servers have been idle for a 'some time', we noticed nrpe_cpu started warning about: 'CPU CRITICAL : idle -0.00%' Which of course is wrong, since the server is 100% idle. We confirmed this by looking at the output of the top command: %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st We tracked this problem down inside the check_linux_stats.pl script used by the nrpe_cpu plugin: [...] Sys::Statistics::Linux->new(cpustats => 1) my $stat = $lxs->get my $cpu = $stat->cpustats->{cpu}; my $cpu_used=sprintf("%.2f", (100-$cpu->{idle})); The issue is, that $cpu->{idle} returns '-0.0' on systems that have been idle for 'some time', while the correct value should be 100.0, which looks a lot like a Overflow or Float-Precision problem... Additional Information: ----------------------- * OS is Debian Jessie (confirmed) and wheezy (unconfirmed) * Sys::Statistics::Linux::CpuStats version is 0.66-1 * Seems to only happen with VMs hosted via Ganeti but not VMware (not 100% confirmed) * For /proc/stats output, see attachment 1 * The full perl script can be found at https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_linux_stats/details

Subject:

proc-stats.txt

Server 1: --------- cpu 5237875 0 1793576 125738733 57900 7 17247 400985488789 0 0 cpu0 5237875 0 1793576 125738733 57900 7 17247 400985488789 0 0 intr 87280124 69 10 0 0 0 0 37 0 1 0 0 0 144 0 0 1332878 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 19443200 92 0 2157006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 157389522 btime 1461584686 processes 5308074 procs_running 1 procs_blocked 0 softirq 141600284 1 59706870 0 20058503 666290 0 16 0 24709 61143895 Server 2: --------- cpu 6689010 0 2764422 121984554 93607 6 25231 1688229435281 0 0 cpu0 6689010 0 2764422 121984554 93607 6 25231 1688229435281 0 0 intr 126876902 68 10 0 0 0 0 37 0 1 0 0 0 144 0 0 1332924 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3718903 10 22702415 154 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 354138606 btime 1461584691 processes 8132214 procs_running 1 procs_blocked 0 softirq 177692670 1 69654374 0 23253334 666313 0 16 0 44914 84073718 Server 3: --------- cpu 5238739 0 2104497 125144495 85309 3 19688 761459701703 0 0 cpu0 5238739 0 2104497 125144495 85309 3 19688 761459701703 0 0 intr 128429341 58 10 0 0 0 0 37 0 1 0 0 0 144 0 0 1332990 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 22627551 150 0 3945448 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 347391550 btime 1461584694 processes 8096784 procs_running 1 procs_blocked 0 softirq 165901986 1 61674565 0 23166530 666346 0 16 0 45676 80348852 Server 4: --------- cpu 4802066 0 1567795 126612640 73042 3 14956 355045411803 0 0 cpu0 4802066 0 1567795 126612640 73042 3 14956 355045411803 0 0 intr 100055222 68 10 0 0 0 0 37 0 1 0 0 0 144 0 0 1333010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 18888912 84 0 3608981 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 192808525 btime 1461584699 processes 4396853 procs_running 1 procs_blocked 0 softirq 131900828 1 56931739 0 19009427 666356 0 16 0 35904 55257385 Server 5: --------- cpu 4357786 0 1275146 127874123 52442 3 13532 836771065724 0 0 cpu0 4357786 0 1275146 127874123 52442 3 13532 836771065724 0 0 intr 82165541 68 10 0 0 0 0 37 0 1 0 0 0 144 0 0 1333040 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 18307157 72 0 1307984 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 146338623 btime 1461584703 processes 4076429 procs_running 1 procs_blocked 0 softirq 125428734 1 54328133 0 18426831 666371 0 16 0 21569 51985813 Server 6: --------- cpu 7169468 0 2224136 121606078 505769 5 20419 627571562152 0 0 cpu0 7169468 0 2224136 121606078 505769 5 20419 627571562152 0 0 intr 122643452 69 10 0 0 0 0 39 0 1 0 0 0 144 0 0 1333100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 22287194 128 0 3489440 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ctxt 236833737 btime 1461584709 processes 5768444 procs_running 2 procs_blocked 0 softirq 159175282 1 67765601 0 22407334 666393 0 16 0 33122 68302815

Wed Jun 01 06:56:03 2016 SIROW [...] cpan.org - Correspondence added

We traced this issue to a Kernel bug that messed up the 'steal time' when Live-Migrating our KVM-based VMs, as indicated by the 8th column in the attached /procs/stats output. After upgrading our VM Cluster (Host) to Kernel 4.5.3 the problem seems to have been fixed. Possibly related: https://bugs.launchpad.net/linux/+bug/1494350 (Issue was fixed in 4.4-rc1)

Wed Jun 01 06:56:09 2016 SIROW [...] cpan.org - Status changed from 'new' to 'rejected'

Bug #114318 for Sys-Statistics-Linux: Sys::Statistics::Linux::CpuStats - Negative zero idle-time