Subject: | Variance and Standard Deviation use costly pseudo-variance, instead of computing real variance |
The current Statistics::Descriptive package uses a pseudo-variance algorithm that was popularized many years ago, which was touted as being able to produce its number without all the datapoints, unlike real variance. However, expanding out the variance sum, it's just as easy to calculate the real thing.
This patch reduces the cost of adding data points, it calculates true variance/standard deviation, and it allows for both variance computation methods (although it only allows the more common for the basis of standard
deviation). As it caches the variance result, reading variance and
standard deviation repeatedly per data point still results in a net
performance improvement over the original. Finally, it adds a clear
method, which reduced the runtime of a particular program which uses
this module far too heavily by around 2%. (This program spends about
60% of its time in Statistics::Descriptive.)
The basis for the change was simply expanding out the 'elegant form'
reported for computing variance, and finding out that the effective
equation is actually significantly simpler than the 'elegant form'.
Specifically, we get, on the numerator (denominator being trivial):
sum of (x(n)^2 - mean^2)
x(1)^2-mean^2+x(2)^2-mean^2+...+x(n)^2-mean^2
x(1)^2+x(2)^2+...+x(n)^2 - n*mean^2
(sum of x(n^2) - n*mean^2
After that, it was simply a matter of applying all of the
simplifications this allowed. I have not added tests for this, but
since there aren't already any standard deviation tests, it still
passes the tests.
Ed
--- Statistics-Descriptive-2.6/Descriptive.pm 2002/10/10 17:03:13 1.1
+++ Statistics-Descriptive-2.6/Descriptive.pm 2004/01/24 18:20:33 1.2
@@ -21,13 +21,12 @@
count => 0,
mean => 0,
sum => 0,
+ sumsq => 0,
variance => undef,
- pseudo_variance => 0,
min => undef,
max => undef,
mindex => undef,
maxdex => undef,
- standard_deviation => undef,
sample_range => undef,
);
@@ -45,7 +44,7 @@
sub add_data {
my $self = shift; ##Myself
my $oldmean;
- my ($min,$mindex,$max,$maxdex);
+ my ($min,$mindex,$max,$maxdex,$sum,$sumsq,$count);
my $aref;
if (ref $_[0] eq 'ARRAY') {
@@ -55,41 +54,84 @@
$aref = \@_;
}
+ ##If we were given no data, we do nothing.
+ return 1 if (!@{ $aref });
+
##Take care of appending to an existing data set
$min = (defined ($self->{min}) ? $self->{min} : $aref->[0]);
$max = (defined ($self->{max}) ? $self->{max} : $aref->[0]);
$maxdex = $self->{maxdex} || 0;
$mindex = $self->{mindex} || 0;
+ $sum = $self->{sum};
+ $sumsq = $self->{sumsq};
+ $count = $self->{count};
- ##Calculate new mean, pseudo-variance, min and max;
+ ##Calculate new mean, sumsq, min and max;
foreach ( @{ $aref } ) {
- $oldmean = $self->{mean};
- $self->{sum} += $_;
- $self->{count}++;
+ $sum += $_;
+ $sumsq += $_**2;
+ $count++;
if ($_ >= $max) {
$max = $_;
- $maxdex = $self->{count}-1;
+ $maxdex = $count-1;
}
if ($_ <= $min) {
$min = $_;
- $mindex = $self->{count}-1;
+ $mindex = $count-1;
}
- $self->{mean} += ($_ - $oldmean) / $self->{count};
- $self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean});
}
$self->{min} = $min;
$self->{mindex} = $mindex;
$self->{max} = $max;
$self->{maxdex} = $maxdex;
- $self->{sample_range} = $self->{max} - $self->{min};
- if ($self->{count} > 1) {
- $self->{variance} = $self->{pseudo_variance} / ($self->{count} -1);
- $self->{standard_deviation} = sqrt( $self->{variance});
- }
+ $self->{sample_range} = $max - $min;
+ $self->{sum} = $sum;
+ $self->{sumsq} = $sumsq;
+ $self->{mean} = $sum / $count;
+ $self->{count} = $count;
+ ##indicator the value is not cached. Variance isn't commonly enough
+ ##used to recompute every single data add.
+ $self->{variance} = undef;
return 1;
}
+sub standard_deviation {
+ my $self = shift; ##Myself
+ return undef if (!$self->{count});
+ return sqrt(variance($self));
+}
+
+##Return variance; if needed, compute and cache it.
+sub variance {
+ my $self = shift; ##Myself
+ my $div = @_ ? 0 : 1;
+ my $count = $self->{count};
+ if ($count < 1 + $div) {
+ return 0;
+ }
+
+ my $variance = $self->{variance};
+ if (!defined($variance)) {
+ $variance = ($self->{sumsq} - $count * $self->{mean}**2);
+ $variance /= $count - $div;
+ $self->{variance} = $variance;
+ }
+ return $variance;
+}
+
+##Clear a stat. More efficient than destroying an object and calling
+##new.
+sub clear {
+ my $self = shift; ##Myself
+ my $key;
+
+ return if (!$self->{count});
+ while (my($field, $value) = each %fields) {
+ $self->{$field} = $value;
+ }
+}
+
sub AUTOLOAD {
my $self = shift;
my $type = ref($self)
@@ -135,6 +177,27 @@
return $self;
}
+##Clear a stat. More efficient than destroying an object and calling
+##new.
+sub clear {
+ my $self = shift; ##Myself
+ my $key;
+
+ return if (!$self->{count});
+ foreach $key (keys %{ $self }) { # Check each key in the object
+ # If it's a reserved key for this class, keep it
+ next if exists $self->{'_reserved'}->{$key};
+ # If it comes from the base class, keep it
+ next if exists $self->{'_permitted'}->{$key};
+ delete $self->{$key}; # Delete the out of date cached key
+ }
+ $self->SUPER::clear();
+ if (exists($self->{data})) {
+ $self->{data} = [];
+ $self->{presorted} = 0;
+ }
+}
+
sub add_data {
my $self = shift;
my $key;
@@ -466,6 +529,16 @@
=item $stat = Statistics::Descriptive::Sparse->new();
Create a new sparse statistics object.
+
+=item $stat->clear();
+
+Effectively the same as
+
+ my $class = ref($stat);
+ undef $stat;
+ $stat = new $class;
+
+except more efficient.
=item $stat->add_data(1,2,3);