Skip Menu |

This queue is for tickets about the Statistics-Descriptive CPAN distribution.

Report information
The Basics
Id: 73473
Status: open
Priority: 0/
Queue: Statistics-Descriptive

People
Owner: Nobody in particular
Requestors: RGARTON [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 3.0203
Fixed in: (no value)



Subject: handling nulls as zeroes - all stats are off
Neat to see recent edits to this module, Statistics::Descriptive, knowing it's alive and well. Concerned about how it treats null values and invalid chars. Seems that, if present in a list, they are treated the same as normal chars. This can mean wrong stats of almost all types returned by the module. This seems broken up to the latest version, 3.0203. The attached script shows how the module returns means (and so other stats) in ways that mightn't be expected in cases when there are invalid or indefinite "values" in the list. So a list of values that includes an "undef" value, or the empty string '', or anything not-a-number (e.g., 'x') returns the same mean as if all these "values" had been "0" - not as if there was no value there at all. That's different to what would be expected; or at least it's different to what happens in corporate software, like SPSS and Excel - where, as expected, the "invalid" and "null" values are not treated like zeroes, don't affect the count of the elements; they're treated as if they didn't exist. The module Statistics::Descriptive does throw up one or another kind of error message in these cases, but it still returns a value, and so one that might be erroneous but slips into work. (The module Statistics::ANOVA depends on Statistics::Descriptives for some stats, but uses Scalar::Util "looks_like_number" to avoid the valid-null problem; maybe Params::Check::is_number would be better. Actually, this only first came to my notice when a bioinformatics person wrote to me about Statistics::ANOVA module yielding different results to R's anova - and that was because, again, this person had invalid or null values in their data, which, because I relied on Statistics::Descriptive, weren't being "ignored" as the researcher expected.) I'd like to hear thoughts on this. At least to document something about it in the manpage?
Subject: desc_test.pl
use strict; use Statistics::Descriptive '3.0203'; my $desc; $desc = Statistics::Descriptive::Full->new(); print "Data \tMeans\n"; $desc->add_data(1,2,3); printf("(1,2,3) \t%s\n", $desc->mean()); $desc = Statistics::Descriptive::Full->new(); $desc->add_data(1,2,3,0); printf("(1,2,3,0)\t%s\n", $desc->mean()); $desc = Statistics::Descriptive::Full->new(); $desc->add_data(1,2,3, undef); printf("(1,2,3,undef)\t%s\n", $desc->mean()); $desc = Statistics::Descriptive::Full->new(); $desc->add_data(1,2,3,''); printf("(1,2,3,'')\t%s\n", $desc->mean()); $desc = Statistics::Descriptive::Full->new(); $desc->add_data(1,2,3, 'x'); printf("(1,2,3,'x')\t%s\n", $desc->mean()); 1;
Hi Roderick, On Sat Dec 24 18:27:40 2011, RGARTON wrote: Show quoted text
> Neat to see recent edits to this module, Statistics::Descriptive, > knowing it's alive and well. >
You're welcome. Show quoted text
> Concerned about how it treats null values and invalid chars. Seems that, > if present in a list, they are treated the same as normal chars. This > can mean wrong stats of almost all types returned by the module. This > seems broken up to the latest version, 3.0203.
Well, undef() values (I assume that's what you mean by nulls) and non-numeric strings are treated as zeros by Perl in numeric context, so this is expected. We are not making a check to see if every datum is numeric. One option would be to implement a wrapper method on top of ->add_data(....) that will check all data for being numeric and only process the valid ones. This will give you a "wrong" count, which will exclude the wrong data being inserted. Would this be acceptable to you? Show quoted text
> > The attached script shows how the module returns means (and so other > stats) in ways that mightn't be expected in cases when there are invalid > or indefinite "values" in the list. So a list of values that includes an > "undef" value, or the empty string '', or anything not-a-number (e.g., > 'x') returns the same mean as if all these "values" had been "0" - not > as if there was no value there at all. That's different to what would be > expected; or at least it's different to what happens in corporate > software, like SPSS and Excel - where, as expected, the "invalid" and > "null" values are not treated like zeroes, don't affect the count of the > elements; they're treated as if they didn't exist. The module > Statistics::Descriptive does throw up one or another kind of error > message in these cases, but it still returns a value, and so one that > might be erroneous but slips into work.
It's not an error - it's a warning (it does not terminate the process, which makes it worse). Show quoted text
> > (The module Statistics::ANOVA depends on Statistics::Descriptives for > some stats, but uses Scalar::Util "looks_like_number" to avoid the > valid-null problem; maybe Params::Check::is_number would be better. > Actually, this only first came to my notice when a bioinformatics person > wrote to me about Statistics::ANOVA module yielding different results to > R's anova - and that was because, again, this person had invalid or null > values in their data, which, because I relied on > Statistics::Descriptive, weren't being "ignored" as the researcher > expected.) >
I see. Show quoted text
> I'd like to hear thoughts on this. At least to document something about > it in the manpage?
If it's OK, I'll do that and implement the ->add_numeric_data() method like I proposed.
Subject: Re: [rt.cpan.org #73473] handling nulls as zeroes - all stats are off
Date: Mon, 09 Jan 2012 08:52:24 -0800
To: bug-Statistics-Descriptive [...] rt.cpan.org
From: rgarton <Roderick.Garton [...] utas.edu.au>
Hello Shlomi Fish. Thank you for your thoughts. I carry the conversation along, below. Show quoted text
----- Original Message ----- From: Shlomi Fish via RT <bug-Statistics-Descriptive@rt.cpan.org> Date: Sunday, January 8, 2012 23:18 Subject: [rt.cpan.org #73473] handling nulls as zeroes - all stats are off To: RGARTON@cpan.org
> <URL: https://rt.cpan.org/Ticket/Display.html?id=73473 > > > Hi Roderick, > > On Sat Dec 24 18:27:40 2011, RGARTON wrote:
> > Neat to see recent edits to this module, Statistics::Descriptive, > > knowing it's alive and well. > >
> > You're welcome. >
> > Concerned about how it treats null values and invalid chars.
> Seems that,
> > if present in a list, they are treated the same as normal
> chars. This
> > can mean wrong stats of almost all types returned by the
> module. This
> > seems broken up to the latest version, 3.0203.
> > Well, undef() values (I assume that's what you mean by nulls) and
Actually, I mean NULL as in SQL, with a wider definition than "undef": "NULL is a special kind of value that actually has no value. It ... signifies that no value is contained within that column for a given row. NULL values are used where the actual value is either not known or not meaningful" (Descartes & Bunce, 2000, "Programming the Perl DBI," US: O'Reilly, p. 57). So undefined or invalid by context.
> non-numeric strings are treated as zeros by Perl in numeric > context, so > this is expected. We are not making a check to see if every > datum is > numeric.
So a Perl Statistics module can't rely on Perl's "numeric context" but has to go a step further. This is expected. Yes, if "$x" = 'A' and $y = $x + 1, $y = 1, but a line of arithmetical data is not the same as a set of statistical data. The mean of ('A', 1) is not 0.5, as Statistics::Descriptive (weirdly) asserts.
> > One option would be to implement a wrapper method on top of > ->add_data(....) that will check all data for being numeric and only > process the valid ones. This will give you a "wrong" count, > which will > exclude the wrong data being inserted. Would this be acceptable > to you?
When a solution yields a wrong count, no, that's not an acceptable solution.
> > > > The attached script shows how the module returns means (and so other > > stats) in ways that mightn't be expected in cases when there
> are invalid
> > or indefinite "values" in the list. So a list of values that
> includes an
> > "undef" value, or the empty string '', or anything not-a-
> number (e.g.,
> > 'x') returns the same mean as if all these "values" had been
> "0" - not
> > as if there was no value there at all. That's different to
> what would be
> > expected; or at least it's different to what happens in corporate > > software, like SPSS and Excel - where, as expected, the
> "invalid" and
> > "null" values are not treated like zeroes, don't affect the
> count of the
> > elements; they're treated as if they didn't exist. The module > > Statistics::Descriptive does throw up one or another kind of error > > message in these cases, but it still returns a value, and so
> one that
> > might be erroneous but slips into work.
> > It's not an error - it's a warning (it does not terminate the process, > which makes it worse).
Ok, a warning is not a child of error.
>
> > > > (The module Statistics::ANOVA depends on
> Statistics::Descriptives for
> > some stats, but uses Scalar::Util "looks_like_number" to avoid the > > valid-null problem; maybe Params::Check::is_number would be better. > > Actually, this only first came to my notice when a
> bioinformatics person
> > wrote to me about Statistics::ANOVA module yielding different
> results to
> > R's anova - and that was because, again, this person had
> invalid or null
> > values in their data, which, because I relied on > > Statistics::Descriptive, weren't being "ignored" as the researcher > > expected.) > >
> > I see. >
> > I'd like to hear thoughts on this. At least to document
> something about
> > it in the manpage?
> > If it's OK, I'll do that and implement the ->add_numeric_data() method > like I proposed.
Everyone agrees that a statistics module of basic descriptives should return the correct mean (which is currently wrong, by Statistics::Descriptive, if there are NULL values) AND to return the correct count (which is currently right by Statistics::Descriptive, even if there are NULL values). So, if we are to opt for conditional wrappers as solutions, a wrapper like "count_valid()" or even "count_all()" (including invalid) might be more appropriate, at the end of the line, than a wrapper about what type of data are entered in the first place. So adding any type of data is permitted, but only defined numerical data are used in the calculations. Again, the data stored by the "add_data()" method are not necessarily the data that are used when, say, calling "standard_deviation()". This seems reasonable, and expected. (This is the default in commercial statistics packages, but they have the "luxury" of requiring you to specify whether your data are numeric or categorical before they will even run a statistical test.) An "add_anybloodything()" method might wrap around the proper, default one, if necessary [Oh, another request: for an alias to the "standard_deviation" method of "stdev", "std_dev" or similar.]
Just to add my thoughts. Ryan - The underlying cause is Perl's handling of non-numeric data when used in numeric operations. All the non-numeric data are treated as zeroes, so the results are correct as far as Perl is concerned. {('A', 1) == (0, 1)}. See this link for a general overview. http://www.perlmonks.org/?node_id=609478 (Sorry if you already knew this). As for the implementation, I'd lean towards a separate add_numeric_data() method. This is purely because I tend to use data that have already been cleaned, so repeated calls to looks_like_number within add_data() will slow down the code unnecessarily. For small data sets this is a non-issue, but I work with systems where add_data() might be called hundreds of thousands of times in an analysis, each time with 1000+ records. Of course, an alternative is to implement the add_numeric_data() method for when one is confident about the data, such that it doesn't check for nulls, blanks, text etc. That's perhaps the safer approach as it results in fewer surprises for users of add_data(). Regards, Shawn. On Mon Jan 09 11:52:40 2012, Roderick.Garton@utas.edu.au wrote: Show quoted text
> Hello Shlomi Fish. > > Thank you for your thoughts. I carry the conversation along, below. > > ----- Original Message ----- > From: Shlomi Fish via RT <bug-Statistics-Descriptive@rt.cpan.org> > Date: Sunday, January 8, 2012 23:18 > Subject: [rt.cpan.org #73473] handling nulls as zeroes - all stats are > off > To: RGARTON@cpan.org >
> > <URL: https://rt.cpan.org/Ticket/Display.html?id=73473 > > > > > Hi Roderick, > > > > On Sat Dec 24 18:27:40 2011, RGARTON wrote:
> > > Neat to see recent edits to this module, Statistics::Descriptive, > > > knowing it's alive and well. > > >
> > > > You're welcome. > >
> > > Concerned about how it treats null values and invalid chars.
> > Seems that,
> > > if present in a list, they are treated the same as normal
> > chars. This
> > > can mean wrong stats of almost all types returned by the
> > module. This
> > > seems broken up to the latest version, 3.0203.
> > > > Well, undef() values (I assume that's what you mean by nulls) and
> > Actually, I mean NULL as in SQL, with a wider definition than "undef": > > "NULL is a special kind of value that actually has no value. It ... > signifies that no value is contained within that column for a given > row. NULL values are used where the actual value is either not > known or not meaningful" (Descartes & Bunce, 2000, "Programming the > Perl DBI," US: O'Reilly, p. 57). > > So undefined or invalid by context. >
> > non-numeric strings are treated as zeros by Perl in numeric > > context, so > > this is expected. We are not making a check to see if every > > datum is > > numeric.
> > So a Perl Statistics module can't rely on Perl's "numeric context" but > has to go a step further. This is expected. Yes, if "$x" = 'A' and > $y = $x + 1, $y = 1, but a line of arithmetical data is not the > same as a set of statistical data. The mean of ('A', 1) is not 0.5, > as Statistics::Descriptive (weirdly) asserts. >
> > > > One option would be to implement a wrapper method on top of > > ->add_data(....) that will check all data for being numeric and only > > process the valid ones. This will give you a "wrong" count, > > which will > > exclude the wrong data being inserted. Would this be acceptable > > to you?
> > When a solution yields a wrong count, no, that's not an acceptable > solution. >
> > > > > > The attached script shows how the module returns means (and so
> other
> > > stats) in ways that mightn't be expected in cases when there
> > are invalid
> > > or indefinite "values" in the list. So a list of values that
> > includes an
> > > "undef" value, or the empty string '', or anything not-a-
> > number (e.g.,
> > > 'x') returns the same mean as if all these "values" had been
> > "0" - not
> > > as if there was no value there at all. That's different to
> > what would be
> > > expected; or at least it's different to what happens in corporate > > > software, like SPSS and Excel - where, as expected, the
> > "invalid" and
> > > "null" values are not treated like zeroes, don't affect the
> > count of the
> > > elements; they're treated as if they didn't exist. The module > > > Statistics::Descriptive does throw up one or another kind of error > > > message in these cases, but it still returns a value, and so
> > one that
> > > might be erroneous but slips into work.
> > > > It's not an error - it's a warning (it does not terminate the
> process,
> > which makes it worse).
> > Ok, a warning is not a child of error. >
> >
> > > > > > (The module Statistics::ANOVA depends on
> > Statistics::Descriptives for
> > > some stats, but uses Scalar::Util "looks_like_number" to avoid the > > > valid-null problem; maybe Params::Check::is_number would be
> better.
> > > Actually, this only first came to my notice when a
> > bioinformatics person
> > > wrote to me about Statistics::ANOVA module yielding different
> > results to
> > > R's anova - and that was because, again, this person had
> > invalid or null
> > > values in their data, which, because I relied on > > > Statistics::Descriptive, weren't being "ignored" as the researcher > > > expected.) > > >
> > > > I see. > >
> > > I'd like to hear thoughts on this. At least to document
> > something about
> > > it in the manpage?
> > > > If it's OK, I'll do that and implement the ->add_numeric_data()
> method
> > like I proposed.
> > Everyone agrees that a statistics module of basic descriptives should > return the correct mean (which is currently wrong, by > Statistics::Descriptive, if there are NULL values) AND to return > the correct count (which is currently right by > Statistics::Descriptive, even if there are NULL values). So, if we > are to opt for conditional wrappers as solutions, a wrapper like > "count_valid()" or even "count_all()" (including invalid) might be > more appropriate, at the end of the line, than a wrapper about > what type of data are entered in the first place. So adding any > type of data is permitted, but only defined numerical data are > used in the calculations. Again, the data stored by the > "add_data()" method are not necessarily the data that are used > when, say, calling "standard_deviation()". This seems reasonable, > and expected. (This is the default in commercial statistics > packages, but they have the "luxury" of requiring you to specify > whether your data are numeric or categorical before they will even > run a statistical test.) An "add_anybloodything()" method might > wrap around the proper, default one, if necessary > > [Oh, another request: for an alias to the "standard_deviation" method > of "stdev", "std_dev" or similar.] > > >
On Sun Feb 05 03:16:39 2012, SLAFFAN wrote: Show quoted text
> Just to add my thoughts. > > Ryan - The underlying cause is Perl's handling of non-numeric data when > used in numeric operations. All the non-numeric data are treated as > zeroes, so the results are correct as far as Perl is concerned. {('A', > 1) == (0, 1)}. See this link for a general overview. > http://www.perlmonks.org/?node_id=609478 (Sorry if you already knew
this). Show quoted text
> > > As for the implementation, I'd lean towards a separate > add_numeric_data() method. This is purely because I tend to use data > that have already been cleaned, so repeated calls to looks_like_number > within add_data() will slow down the code unnecessarily. For small data > sets this is a non-issue, but I work with systems where add_data() might > be called hundreds of thousands of times in an analysis, each time with > 1000+ records. > > Of course, an alternative is to implement the add_numeric_data() method > for when one is confident about the data, such that it doesn't check for > nulls, blanks, text etc. That's perhaps the safer approach as it > results in fewer surprises for users of add_data(). >
Yes, I agree with Shawn here. undef()s in Perl are not exactly nulls in the languages you described, and filtering only the numeric data will slow down the code. Can Roderick provide a spec for how all functions should behave and the when the numeric_count() and the including_non_numeric_data_count() should be used. Regards, -- Shlomi Fish Show quoted text
> Regards, > Shawn. > > > > On Mon Jan 09 11:52:40 2012, Roderick.Garton@utas.edu.au wrote:
> > Hello Shlomi Fish. > > > > Thank you for your thoughts. I carry the conversation along, below. > > > > ----- Original Message ----- > > From: Shlomi Fish via RT <bug-Statistics-Descriptive@rt.cpan.org> > > Date: Sunday, January 8, 2012 23:18 > > Subject: [rt.cpan.org #73473] handling nulls as zeroes - all stats are > > off > > To: RGARTON@cpan.org > >
> > > <URL: https://rt.cpan.org/Ticket/Display.html?id=73473 > > > > > > > Hi Roderick, > > > > > > On Sat Dec 24 18:27:40 2011, RGARTON wrote:
> > > > Neat to see recent edits to this module, Statistics::Descriptive, > > > > knowing it's alive and well. > > > >
> > > > > > You're welcome. > > >
> > > > Concerned about how it treats null values and invalid chars.
> > > Seems that,
> > > > if present in a list, they are treated the same as normal
> > > chars. This
> > > > can mean wrong stats of almost all types returned by the
> > > module. This
> > > > seems broken up to the latest version, 3.0203.
> > > > > > Well, undef() values (I assume that's what you mean by nulls) and
> > > > Actually, I mean NULL as in SQL, with a wider definition than "undef": > > > > "NULL is a special kind of value that actually has no value. It ... > > signifies that no value is contained within that column for a given > > row. NULL values are used where the actual value is either not > > known or not meaningful" (Descartes & Bunce, 2000, "Programming the > > Perl DBI," US: O'Reilly, p. 57). > > > > So undefined or invalid by context. > >
> > > non-numeric strings are treated as zeros by Perl in numeric > > > context, so > > > this is expected. We are not making a check to see if every > > > datum is > > > numeric.
> > > > So a Perl Statistics module can't rely on Perl's "numeric context" but > > has to go a step further. This is expected. Yes, if "$x" = 'A' and > > $y = $x + 1, $y = 1, but a line of arithmetical data is not the > > same as a set of statistical data. The mean of ('A', 1) is not 0.5, > > as Statistics::Descriptive (weirdly) asserts. > >
> > > > > > One option would be to implement a wrapper method on top of > > > ->add_data(....) that will check all data for being numeric and only > > > process the valid ones. This will give you a "wrong" count, > > > which will > > > exclude the wrong data being inserted. Would this be acceptable > > > to you?
> > > > When a solution yields a wrong count, no, that's not an acceptable > > solution. > >
> > > > > > > > The attached script shows how the module returns means (and so
> > other
> > > > stats) in ways that mightn't be expected in cases when there
> > > are invalid
> > > > or indefinite "values" in the list. So a list of values that
> > > includes an
> > > > "undef" value, or the empty string '', or anything not-a-
> > > number (e.g.,
> > > > 'x') returns the same mean as if all these "values" had been
> > > "0" - not
> > > > as if there was no value there at all. That's different to
> > > what would be
> > > > expected; or at least it's different to what happens in corporate > > > > software, like SPSS and Excel - where, as expected, the
> > > "invalid" and
> > > > "null" values are not treated like zeroes, don't affect the
> > > count of the
> > > > elements; they're treated as if they didn't exist. The module > > > > Statistics::Descriptive does throw up one or another kind of error > > > > message in these cases, but it still returns a value, and so
> > > one that
> > > > might be erroneous but slips into work.
> > > > > > It's not an error - it's a warning (it does not terminate the
> > process,
> > > which makes it worse).
> > > > Ok, a warning is not a child of error. > >
> > >
> > > > > > > > (The module Statistics::ANOVA depends on
> > > Statistics::Descriptives for
> > > > some stats, but uses Scalar::Util "looks_like_number" to avoid the > > > > valid-null problem; maybe Params::Check::is_number would be
> > better.
> > > > Actually, this only first came to my notice when a
> > > bioinformatics person
> > > > wrote to me about Statistics::ANOVA module yielding different
> > > results to
> > > > R's anova - and that was because, again, this person had
> > > invalid or null
> > > > values in their data, which, because I relied on > > > > Statistics::Descriptive, weren't being "ignored" as the researcher > > > > expected.) > > > >
> > > > > > I see. > > >
> > > > I'd like to hear thoughts on this. At least to document
> > > something about
> > > > it in the manpage?
> > > > > > If it's OK, I'll do that and implement the ->add_numeric_data()
> > method
> > > like I proposed.
> > > > Everyone agrees that a statistics module of basic descriptives should > > return the correct mean (which is currently wrong, by > > Statistics::Descriptive, if there are NULL values) AND to return > > the correct count (which is currently right by > > Statistics::Descriptive, even if there are NULL values). So, if we > > are to opt for conditional wrappers as solutions, a wrapper like > > "count_valid()" or even "count_all()" (including invalid) might be > > more appropriate, at the end of the line, than a wrapper about > > what type of data are entered in the first place. So adding any > > type of data is permitted, but only defined numerical data are > > used in the calculations. Again, the data stored by the > > "add_data()" method are not necessarily the data that are used > > when, say, calling "standard_deviation()". This seems reasonable, > > and expected. (This is the default in commercial statistics > > packages, but they have the "luxury" of requiring you to specify > > whether your data are numeric or categorical before they will even > > run a statistical test.) An "add_anybloodything()" method might > > wrap around the proper, default one, if necessary > > > > [Oh, another request: for an alias to the "standard_deviation" method > > of "stdev", "std_dev" or similar.] > > > > > >
> >
On Wed Feb 08 12:58:46 2012, SHLOMIF wrote: Show quoted text
> On Sun Feb 05 03:16:39 2012, SLAFFAN wrote:
Show quoted text
> .... Can Roderick provide a spec for how all functions > should behave and the when the numeric_count() and the > including_non_numeric_data_count() should be used. >
Best specification here might be to take out the data management routines from Statistics-Descriptive altogether and have it depend on a Statistics-Data module for that. I will try to register that module, but will need help, suggestions, ideas as to how to make that work. This module should also optionally serve any other stats modules - and so hopefully also solve the problem of no two Perl statistics modules having the same data-handling routines - neither by name nor by behavior. What do you all think?
Hi Roderick, On Fri May 25 03:01:08 2012, RGARTON wrote: Show quoted text
> On Wed Feb 08 12:58:46 2012, SHLOMIF wrote:
> > On Sun Feb 05 03:16:39 2012, SLAFFAN wrote:
>
> > .... Can Roderick provide a spec for how all functions > > should behave and the when the numeric_count() and the > > including_non_numeric_data_count() should be used. > >
> > Best specification here might be to take out the data management > routines from Statistics-Descriptive altogether and have it depend on a > Statistics-Data module for that. I will try to register that module, but > will need help, suggestions, ideas as to how to make that work.
You don't need to register it. Once you upload it to CPAN, you own the namespace, and http://metacpan.org/ , http://search.cpan.org/ and other indexers will be able to find it there. Regarding suggestions of how to make it work - it shouldn't slow down Statistics::Descriptive (i.e: have a mode where a user can pass undef()s and they'll be treated as zeros), and the data handling logic of Statistics-Descriptive is something that works nicely for it, and you may opt to emulate it. Show quoted text
> This > module should also optionally serve any other stats modules - and so > hopefully also solve the problem of no two Perl statistics modules > having the same data-handling routines - neither by name nor by > behavior. What do you all think?
Well, don't expect modules based on GNU R, the GNU Scientific Library (GSL), and similar modules, to rely on Statistics::Data for that, so you're not solving the entire problem. Furthermore, this sounds like Joel Spolsky’s quarreling kids syndrome ( see http://www.shlomifish.org/humour/fortunes/show.cgi?id=joel-ms-lost-api-war-1 ), which is also echoed in this XKCD: http://xkcd.com/927/ . That put aside, if you have a good vision for what Statistics-Data should do, you can write it and upload it to CPAN, and if it would be good enough, then Statistics::Descriptive can use it (or you can write Statistics::Descriptive::UsingData - ;-) ). Regards, -- Shlomi Fish
Thanks for the ideas and critique. I'll carry on with the Statistics::Data module, which I've been writing in any case to parcel out these type of operations from Statistics::ANOVA. Only thinking this approach might be useful for some modules in that Statistics family. Help in deciding on the best way to create the child/parent or other relationship between Statistics::Data and, say, Statistics::ANOVA, would be useful. Can you recommend another relationship of this type among any Perl modules?
Subject: [rt.cpan.org #73473] ​handling nulls as zeroes - all stats are off
Date: Sat, 9 Jan 2016 23:49:55 +0200
To: bug-Statistics-Descriptive [...] rt.cpan.org
From: Meir Guttman <mguttman4 [...] gmail.com>
​Hi all, IMHO we need both "treat undef as zero" and "eliminate undef from the list" varieties. Here is one field in which both are required - stock-exchange data processing. ​Some items require the substitution of zero for an 'undef'. For example, when computing the average daily Dollar trading volume in a given share over a given period. The custom is to mark it as NULL, not zero and DBI returns an 'undef' in such a case. No trading means zero trading for the average volume. OTOH, when computing Avg/StdDev of prices, say Day's High price for a share that was not traded that day, 'undef' items must not be taken as zero. No quote doesn't mean a zero price. Such datum must be simply excluded from the computation! And, here too no-trade is depicted as NULL, hence DBI returns an 'undef'. That's my two bit... regards, Meir
Hi Meir! On Sat Jan 09 16:50:20 2016, mguttman4@gmail.com wrote: Show quoted text
> ​Hi all, > > IMHO we need both "treat undef as zero" and "eliminate undef from the list" > varieties. Here is one field in which both are required - stock-exchange > data processing.
If I understand you correctly, both of these are easy to do using «map { $_ // 0 } @data» and «grep { defined($_) } @data». Is there anything stopping you from doing it this way? Regards, -- Shlomi Fish
Subject: [rt.cpan.org #73473] ​handling nulls as zeroes - all stats are off
Date: Mon, 11 Jan 2016 23:51:28 +0200
To: bug-Statistics-Descriptive [...] rt.cpan.org
From: Meir Guttman <mguttman4 [...] gmail.com>
Hi Shlomi! Yes indeed, that is what I decided to do. Under my control, no need to fiddle with switches, etc. Thanks! Regards everybody, Meir -- -- Me