Skip Menu |

This queue is for tickets about the Statistics-Descriptive CPAN distribution.

Report information
The Basics
Id: 58187
Status: resolved
Priority: 0/
Queue: Statistics-Descriptive

People
Owner: Nobody in particular
Requestors: SLAFFAN [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 3.0100
Fixed in: (no value)



Subject: patch for skewness and kurtosis functions
Attached is a diff file to add skewness and kurtosis functions in Statistics::Descriptive::Full (3.0100). It includes regression tests and POD updates. It would be nice to have these available via S::D::Sparse using higher order moments, but that can be patched in when developed. The OO inheritance means users won't notice anything when it happens. Shawn Laffan.
Subject: skew_and_kurt_patch.diff
Index: lib/Statistics/Descriptive.pm =================================================================== --- lib/Statistics/Descriptive.pm (revision 4433) +++ lib/Statistics/Descriptive.pm (working copy) @@ -10,7 +10,7 @@ ##Perl5. 01-03 weren't bug free. use vars (qw($VERSION $Tolerance)); -$VERSION = '3.0100'; +$VERSION = '3.0200'; $Tolerance = 0.0; @@ -247,6 +247,7 @@ __PACKAGE__->_make_private_accessors( [qw(data frequency geometric_mean harmonic_mean least_squares_fit median mode + skewness kurtosis ) ] ); @@ -599,6 +600,71 @@ return $self->_geometric_mean(); } +sub skewness { + my $self = shift; + + if (!defined($self->_skewness())) + { + my $n = $self->count(); + my $sd = $self->standard_deviation(); + + my $skew; + + # skip if insufficient records + if ( $sd && $n > 2) { + + my $mean = $self->mean(); + + my $sum_pow3; + + foreach my $rec ( $self->get_data ) { + my $value = (($rec - $mean) / $sd); + $sum_pow3 += $value ** 3; + } + + my $correction = $n / ( ($n-1) * ($n-2) ); + + $skew = $correction * $sum_pow3; + } + + $self->_skewness($skew); + } + + return $self->_skewness(); +} + +sub kurtosis { + my $self = shift; + + if (!defined($self->_kurtosis())) + { + my $kurt; + + my $n = $self->count(); + my $sd = $self->standard_deviation(); + + if ( $sd && $n > 3) { + + my $mean = $self->mean(); + + my $sum_pow4; + foreach my $rec ( $self->get_data ) { + $sum_pow4 += ( ($rec - $mean ) / $sd ) ** 4; + } + + my $correction1 = ( $n * ($n+1) ) / ( ($n-1) * ($n-2) * ($n-3) ); + my $correction2 = ( 3 * ($n-1) ** 2) / ( ($n-2) * ($n-3) ); + + $kurt = ( $correction1 * $sum_pow4 ) - $correction2; + } + + $self->_kurtosis($kurt); + } + + return $self->_kurtosis(); +} + + sub frequency_distribution_ref { my $self = shift; @@ -895,6 +961,19 @@ is called. Calling the method without an argument returns the value of the flag. +=item $stat->skewness(); + +Returns the skewness of the data. +A value of zero is no skew, negative is a left skewed tail, +positive is a right skewed tail. +This is consistent with Excel. + +=item $stat->kurtosis(); + +Returns the kurtosis of the data. +Positive is peaked, negative is flattened. + + =item $x = $stat->percentile(25); =item ($x, $index) = $stat->percentile(25); @@ -1170,6 +1249,10 @@ =over 4 +=item v3.0200 + +Add skewness and kurtosis. + =item v2.3 Rolled into November 1998 Index: t/descr.t =================================================================== --- t/descr.t (revision 4433) +++ t/descr.t (working copy) @@ -3,7 +3,7 @@ use strict; use warnings; -use Test::More tests => 23; +use Test::More; use Benchmark; use Statistics::Descriptive; @@ -372,3 +372,47 @@ ) } +{ + my $stat = Statistics::Descriptive::Full->new(); + + $stat->add_data(1 .. 9, 100); + + # TEST + is ($stat->skewness(), + 3.11889574523909, + 'Skewness of 3.11889574523909' + ); + + # TEST + is ($stat->kurtosis(), + 9.79924471616366, + 'Kurtosis of 9.79924471616366' + ); + +} + +{ + my $stat = Statistics::Descriptive::Full->new(); + + $stat->add_data(1,2); + my $def; + + # TEST + $def = defined $stat->skewness() ? 1 : 0; + is ($def, + 0, + 'Skewness is undef for 2 samples' + ); + + $stat->add_data (1); + + # TEST + $def = defined $stat->kurtosis() ? 1 : 0; + is ($def, + 0, + 'Kurtosis is undef for 3 samples' + ); + +} + +done_testing();
Hi Shawn, On Mon Jun 07 02:44:54 2010, SLAFFAN wrote: Show quoted text
> Attached is a diff file to add skewness and kurtosis functions in > Statistics::Descriptive::Full (3.0100). It includes regression tests > and POD updates. > > It would be nice to have these available via S::D::Sparse using higher > order moments, but that can be patched in when developed. The OO > inheritance means users won't notice anything when it happens. >
Here are some comments on your patch: 1. You should not bump the $VERSION in the patch. 2. You've added skewness and kurtosis to the private accessors, and you seem to memoise them but there doesn't appear to be a public interface to reset them. So they may be cached and not updated when further data is added. 3. You seem to have mixed tabs and spaces. Please keep the patch spaces-only and 4-spaces wide. 4. You should use in_between() in t/descr.t for testing the exact values of skewness and kurtosis, because exactly comparing floating point numbers is unreliable. 5. Please don't get rid of the plan and use "done_testing()". Instead, you should use Test-Count - http://search.cpan.org/dist/Test-Count/ . See: http://use.perl.org/~Alias/journal/39916 6. You also need to update the Changes. ---- Please correct the patch and resubmit. Regards, -- Shlomi Fish
Hi Shawn, On Mon Jun 07 02:44:54 2010, SLAFFAN wrote: Show quoted text
> Attached is a diff file to add skewness and kurtosis functions in > Statistics::Descriptive::Full (3.0100). It includes regression tests > and POD updates. > > It would be nice to have these available via S::D::Sparse using higher > order moments, but that can be patched in when developed. The OO > inheritance means users won't notice anything when it happens. > > Shawn Laffan.
I commented on your patch with some errata over a week ago, and yet you didn't respond. Please correct the patch according to my input, or I'll have to close this bug, and drop the patch. If you wish you can send a simple acknowledgement that you're dealing with it. Regards, -- Shlomi Fish
Hi Shlomi, Here's the revised patch. It's diffed against revision 4455 of the svn repository. In response to your comments: 1. Done. Version now left alone. 2. The code is modelled on the harmonic_mean sub. I've stepped through the code and checked that the values are cleared when new data are added. I've also added tests for this in descr.t. 3. Whoops. Better pay closer attention to my IDE in future. 4. is_between() is now used for the tests I added. I used a tolerance of +/-1e-13 around the expected which should provide plenty of precision while still allowing for floating point differences. 5. done_testing() approach removed and the test count updated instead. 6. Change document updated. I've listed the version number in that as 3.0???, so it's up to you to change as you see fit. This version number place-holder text is also in the Revision History section of the POD. That said, one could argue that the revision history in the POD could be removed since the Change document is the up-to-date listing and the POD only has up to version 2.3 (released 1998). That's your call, though. Let me know if there are any more modifications needed. Regards, Shawn.
Subject: skew_and_kurt_patch_mk2.diff
Index: Changes =================================================================== --- Changes (revision 4455) +++ Changes (working copy) @@ -1,5 +1,12 @@ Revision history for Perl extension Statistics::Descriptive. + + + +3.0??? June 2010 + - Added skewness and kurtosis + - https://rt.cpan.org/Ticket/Display.html?id=58187 + 3.0102 June 15, 2010 - Add the $VERSION variable to Statistics::Descriptive::Sparse and Statistics::Descriptive::Full. This was done to silence the CPAN indexer. Index: lib/Statistics/Descriptive.pm =================================================================== --- lib/Statistics/Descriptive.pm (revision 4455) +++ lib/Statistics/Descriptive.pm (working copy) @@ -255,6 +255,7 @@ __PACKAGE__->_make_private_accessors( [qw(data frequency geometric_mean harmonic_mean least_squares_fit median mode + skewness kurtosis ) ] ); @@ -607,6 +608,71 @@ return $self->_geometric_mean(); } +sub skewness { + my $self = shift; + + if (!defined($self->_skewness())) + { + my $n = $self->count(); + my $sd = $self->standard_deviation(); + + my $skew; + + # skip if insufficient records + if ( $sd && $n > 2) { + + my $mean = $self->mean(); + + my $sum_pow3; + + foreach my $rec ( $self->get_data ) { + my $value = (($rec - $mean) / $sd); + $sum_pow3 += $value ** 3; + } + + my $correction = $n / ( ($n-1) * ($n-2) ); + + $skew = $correction * $sum_pow3; + } + + $self->_skewness($skew); + } + + return $self->_skewness(); +} + +sub kurtosis { + my $self = shift; + + if (!defined($self->_kurtosis())) + { + my $kurt; + + my $n = $self->count(); + my $sd = $self->standard_deviation(); + + if ( $sd && $n > 3) { + + my $mean = $self->mean(); + + my $sum_pow4; + foreach my $rec ( $self->get_data ) { + $sum_pow4 += ( ($rec - $mean ) / $sd ) ** 4; + } + + my $correction1 = ( $n * ($n+1) ) / ( ($n-1) * ($n-2) * ($n-3) ); + my $correction2 = ( 3 * ($n-1) ** 2) / ( ($n-2) * ($n-3) ); + + $kurt = ( $correction1 * $sum_pow4 ) - $correction2; + } + + $self->_kurtosis($kurt); + } + + return $self->_kurtosis(); +} + + sub frequency_distribution_ref { my $self = shift; @@ -903,6 +969,19 @@ is called. Calling the method without an argument returns the value of the flag. +=item $stat->skewness(); + +Returns the skewness of the data. +A value of zero is no skew, negative is a left skewed tail, +positive is a right skewed tail. +This is consistent with Excel. + +=item $stat->kurtosis(); + +Returns the kurtosis of the data. +Positive is peaked, negative is flattened. + + =item $x = $stat->percentile(25); =item ($x, $index) = $stat->percentile(25); @@ -1178,6 +1257,10 @@ =over 4 +=item v3.0??? + +Add skewness and kurtosis. + =item v2.3 Rolled into November 1998 Index: t/descr.t =================================================================== --- t/descr.t (revision 4455) +++ t/descr.t (working copy) @@ -3,7 +3,7 @@ use strict; use warnings; -use Test::More tests => 22; +use Test::More tests => 29; use Benchmark; use Statistics::Descriptive; @@ -346,3 +346,70 @@ ) } +{ + my $stat = Statistics::Descriptive::Full->new(); + my $expected; + + $stat->add_data(1 .. 9, 100); +#my $x = $stat->harmonic_mean; + # TEST + $expected = 3.11889574523909; + is_between ($stat->skewness(), + $expected - 1E-13, + $expected + 1E-13, + "Skewness of $expected +/- 1E-13" + ); + + # TEST + $expected = 9.79924471616366; + is_between ($stat->kurtosis(), + $expected - 1E-13, + $expected + 1E-13, + "Kurtosis of $expected +/- 1E-13" + ); + + $stat->add_data(100 .. 110); + + # now check that cached skew and kurt values are recalculated + + # TEST + $expected = -0.306705104889384; + is_between ($stat->skewness(), + $expected - 1E-13, + $expected + 1E-13, + "Skewness of $expected +/- 1E-13" + ); + + # TEST + $expected = -2.09839497356215; + is_between ($stat->kurtosis(), + $expected - 1E-13, + $expected + 1E-13, + "Kurtosis of $expected +/- 1E-13" + ); +} + +{ + my $stat = Statistics::Descriptive::Full->new(); + + $stat->add_data(1,2); + my $def; + + # TEST + $def = defined $stat->skewness() ? 1 : 0; + is ($def, + 0, + 'Skewness is undef for 2 samples' + ); + + $stat->add_data (1); + + # TEST + $def = defined $stat->kurtosis() ? 1 : 0; + is ($def, + 0, + 'Kurtosis is undef for 3 samples' + ); + +} +
Hi Shawn, thanks for the revised patch, I applied it after some small revisions and released Statistics-Descriptive-3.0200. On Fri Jun 18 00:47:27 2010, SLAFFAN wrote: Show quoted text
> Hi Shlomi, > > Here's the revised patch. It's diffed against revision 4455 of the svn > repository. > > In response to your comments: > > 1. Done. Version now left alone. > > 2. The code is modelled on the harmonic_mean sub. I've stepped through > the code and checked that the values are cleared when new data are > added. I've also added tests for this in descr.t. >
To avoid me having to see my points and your replies to them in different browser pane, and have to jump back and forth, next time please use rt.cpan.org's reply-to-comment feature and reply inline. See: http://en.wikipedia.org/wiki/Posting_style Regards, -- Shlomi Fish
On Fri Jun 18 03:31:45 2010, SHLOMIF wrote: Show quoted text
> Hi Shawn, > > thanks for the revised patch, I applied it after some small revisions > and released Statistics-Descriptive-3.0200. > > On Fri Jun 18 00:47:27 2010, SLAFFAN wrote:
> > Hi Shlomi, > > > > Here's the revised patch. It's diffed against revision 4455 of the svn > > repository. > > > > In response to your comments: > > > > 1. Done. Version now left alone. > > > > 2. The code is modelled on the harmonic_mean sub. I've stepped through > > the code and checked that the values are cleared when new data are > > added. I've also added tests for this in descr.t. > >
> > To avoid me having to see my points and your replies to them in > different browser pane, and have to jump back and forth, next time > please use rt.cpan.org's reply-to-comment feature and reply inline. See: > > http://en.wikipedia.org/wiki/Posting_style > > Regards, > > -- Shlomi Fish
No worries Shlomi. Regards, Shawn.
Resolving as adding a reply reopens the ticket automatically.