Bug #9160 for Statistics-Descriptive: Variance and Standard Deviation use costly pseudo-variance, instead of computing real variance

Sun Dec 26 22:06:22 2004 Guest - Ticket created

Subject:

Variance and Standard Deviation use costly pseudo-variance, instead of computing real variance

The current Statistics::Descriptive package uses a pseudo-variance algorithm that was popularized many years ago, which was touted as being able to produce its number without all the datapoints, unlike real variance. However, expanding out the variance sum, it's just as easy to calculate the real thing. This patch reduces the cost of adding data points, it calculates true variance/standard deviation, and it allows for both variance computation methods (although it only allows the more common for the basis of standard deviation). As it caches the variance result, reading variance and standard deviation repeatedly per data point still results in a net performance improvement over the original. Finally, it adds a clear method, which reduced the runtime of a particular program which uses this module far too heavily by around 2%. (This program spends about 60% of its time in Statistics::Descriptive.) The basis for the change was simply expanding out the 'elegant form' reported for computing variance, and finding out that the effective equation is actually significantly simpler than the 'elegant form'. Specifically, we get, on the numerator (denominator being trivial): sum of (x(n)^2 - mean^2) x(1)^2-mean^2+x(2)^2-mean^2+...+x(n)^2-mean^2 x(1)^2+x(2)^2+...+x(n)^2 - n*mean^2 (sum of x(n^2) - n*mean^2 After that, it was simply a matter of applying all of the simplifications this allowed. I have not added tests for this, but since there aren't already any standard deviation tests, it still passes the tests. Ed

--- Statistics-Descriptive-2.6/Descriptive.pm 2002/10/10 17:03:13 1.1 +++ Statistics-Descriptive-2.6/Descriptive.pm 2004/01/24 18:20:33 1.2 @@ -21,13 +21,12 @@ count => 0, mean => 0, sum => 0, + sumsq => 0, variance => undef, - pseudo_variance => 0, min => undef, max => undef, mindex => undef, maxdex => undef, - standard_deviation => undef, sample_range => undef, ); @@ -45,7 +44,7 @@ sub add_data { my $self = shift; ##Myself my $oldmean; - my ($min,$mindex,$max,$maxdex); + my ($min,$mindex,$max,$maxdex,$sum,$sumsq,$count); my $aref; if (ref $_[0] eq 'ARRAY') { @@ -55,41 +54,84 @@ $aref = \@_; } + ##If we were given no data, we do nothing. + return 1 if (!@{ $aref }); + ##Take care of appending to an existing data set $min = (defined ($self->{min}) ? $self->{min} : $aref->[0]); $max = (defined ($self->{max}) ? $self->{max} : $aref->[0]); $maxdex = $self->{maxdex} || 0; $mindex = $self->{mindex} || 0; + $sum = $self->{sum}; + $sumsq = $self->{sumsq}; + $count = $self->{count}; - ##Calculate new mean, pseudo-variance, min and max; + ##Calculate new mean, sumsq, min and max; foreach ( @{ $aref } ) { - $oldmean = $self->{mean}; - $self->{sum} += $_; - $self->{count}++; + $sum += $_; + $sumsq += $_**2; + $count++; if ($_ >= $max) { $max = $_; - $maxdex = $self->{count}-1; + $maxdex = $count-1; } if ($_ <= $min) { $min = $_; - $mindex = $self->{count}-1; + $mindex = $count-1; } - $self->{mean} += ($_ - $oldmean) / $self->{count}; - $self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean}); } $self->{min} = $min; $self->{mindex} = $mindex; $self->{max} = $max; $self->{maxdex} = $maxdex; - $self->{sample_range} = $self->{max} - $self->{min}; - if ($self->{count} > 1) { - $self->{variance} = $self->{pseudo_variance} / ($self->{count} -1); - $self->{standard_deviation} = sqrt( $self->{variance}); - } + $self->{sample_range} = $max - $min; + $self->{sum} = $sum; + $self->{sumsq} = $sumsq; + $self->{mean} = $sum / $count; + $self->{count} = $count; + ##indicator the value is not cached. Variance isn't commonly enough + ##used to recompute every single data add. + $self->{variance} = undef; return 1; } +sub standard_deviation { + my $self = shift; ##Myself + return undef if (!$self->{count}); + return sqrt(variance($self)); +} + +##Return variance; if needed, compute and cache it. +sub variance { + my $self = shift; ##Myself + my $div = @_ ? 0 : 1; + my $count = $self->{count}; + if ($count < 1 + $div) { + return 0; + } + + my $variance = $self->{variance}; + if (!defined($variance)) { + $variance = ($self->{sumsq} - $count * $self->{mean}**2); + $variance /= $count - $div; + $self->{variance} = $variance; + } + return $variance; +} + +##Clear a stat. More efficient than destroying an object and calling +##new. +sub clear { + my $self = shift; ##Myself + my $key; + + return if (!$self->{count}); + while (my($field, $value) = each %fields) { + $self->{$field} = $value; + } +} + sub AUTOLOAD { my $self = shift; my $type = ref($self) @@ -135,6 +177,27 @@ return $self; } +##Clear a stat. More efficient than destroying an object and calling +##new. +sub clear { + my $self = shift; ##Myself + my $key; + + return if (!$self->{count}); + foreach $key (keys %{ $self }) { # Check each key in the object + # If it's a reserved key for this class, keep it + next if exists $self->{'_reserved'}->{$key}; + # If it comes from the base class, keep it + next if exists $self->{'_permitted'}->{$key}; + delete $self->{$key}; # Delete the out of date cached key + } + $self->SUPER::clear(); + if (exists($self->{data})) { + $self->{data} = []; + $self->{presorted} = 0; + } +} + sub add_data { my $self = shift; my $key; @@ -466,6 +529,16 @@ =item $stat = Statistics::Descriptive::Sparse->new(); Create a new sparse statistics object. + +=item $stat->clear(); + +Effectively the same as + + my $class = ref($stat); + undef $stat; + $stat = new $class; + +except more efficient. =item $stat->add_data(1,2,3);

Mon Dec 27 17:35:36 2004 Guest - Correspondence added

From:

tgape [...] comcast.net

[guest - Sun Dec 26 22:06:22 2004]: Show quoted text

> The current Statistics::Descriptive package uses a pseudo-variance > algorithm that was popularized many years ago, which was touted as > being able to produce its number without all the datapoints, unlike > real variance. However, expanding out the variance sum, it's just > as easy to calculate the real thing.

Back before I first created this patch, I had noticed a skew in standard deviation reported by a program which makes massive use of this package. I did not think of creating a test case at that time, because I could simply run my program with a particular dataset which demonstrated the issue fairly well. For some time, I ran both the 2.6 version and my version in parallel, and saw that the deviations were different. Since submitting this patch here, I've been trying to reproduce the issue, and I cannot. I'm therefore thinking this is not quite so important - especially since my analytic review isn't showing a difference either. I'm now wondering if it was a case of an expanding rounding error. My version does still work more efficiently, but as a mere optimization, this would have a fairly low priority. If I can reproduce the error, I'll submit a test case for it.

Sat Aug 27 21:22:33 2005 Guest - Correspondence added

From:

tgape [...] comcast.net

The variance calculation in Statistics::Descriptive is actually correct, even if it is a bit more expensive. However, it is subject to a propogated rounding error, as it loses the original numbers. The simplified algorithm does not suffer from this issue. My difficulty in reproducing the problem was that I had since upgraded systems, and the newer system did not have the same rounding error of the prior system. It would still be nice to get the performance improvements of the patch.

Sun May 03 06:40:58 2009 SHLOMIF [...] cpan.org - Correspondence added

The patch was applied in the trunk, thanks.

Sun May 03 06:40:59 2009 The RT System itself - Status changed from 'new' to 'open'

Sun May 03 06:41:01 2009 SHLOMIF [...] cpan.org - Status changed from 'open' to 'resolved'

Fri Nov 18 19:08:00 2011 LGEHLEN [...] cpan.org - Correspondence added

Hello Shlomi, Hello Ed, I am not sure if reopening the ticket is the right way to go about this, but I would certainly like to bring my concern about this issue to your attention. I do not agree with this patch. The problem with the current formula is that it is - albeit a bit cheaper - numerically very unstable. The problem is that the sum of x**2 and n*mean**2 can both get very big. Therefore, the variance is a possibly small difference of two possibly very big numbers and might therefore get very imprecise, even 0 due to numerical instability. Consider a not at all unrealistic example where the values are around 1,000,000, but fluctuate to a much smaller degree. Then the values of x**2 are around 10**12, if you add 1000 values, the sum is around 10**15. However, if the standard deviation is 1% - which is certainly something you would like to detect - then the variance is 10**8 which is 7 orders of magnitude smaller then the values you subtract from each other. I have had a quick look at version 2.6. I have not verified that it does exactly the right thing, but it is clear that the algorithm in there remedies this numerical instability by using differences to the current estimate of the mean. Best wishes, Lutz

Fri Nov 18 19:08:01 2011 The RT System itself - Status changed from 'resolved' to 'open'

Sat Nov 19 02:10:02 2011 SHLOMIF [...] cpan.org - Correspondence added

Hi LGEHLEN, On Fri Nov 18 19:08:00 2011, LGEHLEN wrote: Show quoted text

> Hello Shlomi, Hello Ed, > I am not sure if reopening the ticket is the right way to go about this,

It is not. Please file a new bug and provide a test case (and preferably also a patch) and some direction. I'm resolving this bug - please file a new bug instead of commenting. Also see: https://rt.cpan.org/Public/Bug/Display.html?id=66712 Regards, -- Shlomi Fish Show quoted text

> but I would certainly like to bring my concern about this issue to your > attention.

Show quoted text

> > I do not agree with this patch. The problem with the current formula is > that it is - albeit a bit cheaper - numerically very unstable. The > problem is that the sum of x**2 and n*mean**2 can both get very big. > Therefore, the variance is a possibly small difference of two possibly > very big numbers and might therefore get very imprecise, even 0 due to > numerical instability. > > Consider a not at all unrealistic example where the values are around > 1,000,000, but fluctuate to a much smaller degree. Then the values of > x**2 are around 10**12, if you add 1000 values, the sum is around > 10**15. However, if the standard deviation is 1% - which is certainly > something you would like to detect - then the variance is 10**8 which is > 7 orders of magnitude smaller then the values you subtract from each

other. Show quoted text

> > I have had a quick look at version 2.6. I have not verified that it does > exactly the right thing, but it is clear that the algorithm in there > remedies this numerical instability by using differences to the current > estimate of the mean. > > Best wishes, > Lutz

Sat Nov 19 02:10:03 2011 SHLOMIF [...] cpan.org - Status changed from 'open' to 'resolved'