> Regards,
> Shawn.
>
>
>
> On Mon Jan 09 11:52:40 2012, Roderick.Garton@utas.edu.au wrote:
> > Hello Shlomi Fish.
> >
> > Thank you for your thoughts. I carry the conversation along, below.
> >
> > ----- Original Message -----
> > From: Shlomi Fish via RT <bug-Statistics-Descriptive@rt.cpan.org>
> > Date: Sunday, January 8, 2012 23:18
> > Subject: [rt.cpan.org #73473] handling nulls as zeroes - all stats are
> > off
> > To: RGARTON@cpan.org
> >
> > > <URL:
https://rt.cpan.org/Ticket/Display.html?id=73473 >
> > >
> > > Hi Roderick,
> > >
> > > On Sat Dec 24 18:27:40 2011, RGARTON wrote:
> > > > Neat to see recent edits to this module, Statistics::Descriptive,
> > > > knowing it's alive and well.
> > > >
> > >
> > > You're welcome.
> > >
> > > > Concerned about how it treats null values and invalid chars.
> > > Seems that,
> > > > if present in a list, they are treated the same as normal
> > > chars. This
> > > > can mean wrong stats of almost all types returned by the
> > > module. This
> > > > seems broken up to the latest version, 3.0203.
> > >
> > > Well, undef() values (I assume that's what you mean by nulls) and
> >
> > Actually, I mean NULL as in SQL, with a wider definition than "undef":
> >
> > "NULL is a special kind of value that actually has no value. It ...
> > signifies that no value is contained within that column for a given
> > row. NULL values are used where the actual value is either not
> > known or not meaningful" (Descartes & Bunce, 2000, "Programming the
> > Perl DBI," US: O'Reilly, p. 57).
> >
> > So undefined or invalid by context.
> >
> > > non-numeric strings are treated as zeros by Perl in numeric
> > > context, so
> > > this is expected. We are not making a check to see if every
> > > datum is
> > > numeric.
> >
> > So a Perl Statistics module can't rely on Perl's "numeric context" but
> > has to go a step further. This is expected. Yes, if "$x" = 'A' and
> > $y = $x + 1, $y = 1, but a line of arithmetical data is not the
> > same as a set of statistical data. The mean of ('A', 1) is not 0.5,
> > as Statistics::Descriptive (weirdly) asserts.
> >
> > >
> > > One option would be to implement a wrapper method on top of
> > > ->add_data(....) that will check all data for being numeric and only
> > > process the valid ones. This will give you a "wrong" count,
> > > which will
> > > exclude the wrong data being inserted. Would this be acceptable
> > > to you?
> >
> > When a solution yields a wrong count, no, that's not an acceptable
> > solution.
> >
> > > >
> > > > The attached script shows how the module returns means (and so
> > other
> > > > stats) in ways that mightn't be expected in cases when there
> > > are invalid
> > > > or indefinite "values" in the list. So a list of values that
> > > includes an
> > > > "undef" value, or the empty string '', or anything not-a-
> > > number (e.g.,
> > > > 'x') returns the same mean as if all these "values" had been
> > > "0" - not
> > > > as if there was no value there at all. That's different to
> > > what would be
> > > > expected; or at least it's different to what happens in corporate
> > > > software, like SPSS and Excel - where, as expected, the
> > > "invalid" and
> > > > "null" values are not treated like zeroes, don't affect the
> > > count of the
> > > > elements; they're treated as if they didn't exist. The module
> > > > Statistics::Descriptive does throw up one or another kind of error
> > > > message in these cases, but it still returns a value, and so
> > > one that
> > > > might be erroneous but slips into work.
> > >
> > > It's not an error - it's a warning (it does not terminate the
> > process,
> > > which makes it worse).
> >
> > Ok, a warning is not a child of error.
> >
> > >
> > > >
> > > > (The module Statistics::ANOVA depends on
> > > Statistics::Descriptives for
> > > > some stats, but uses Scalar::Util "looks_like_number" to avoid the
> > > > valid-null problem; maybe Params::Check::is_number would be
> > better.
> > > > Actually, this only first came to my notice when a
> > > bioinformatics person
> > > > wrote to me about Statistics::ANOVA module yielding different
> > > results to
> > > > R's anova - and that was because, again, this person had
> > > invalid or null
> > > > values in their data, which, because I relied on
> > > > Statistics::Descriptive, weren't being "ignored" as the researcher
> > > > expected.)
> > > >
> > >
> > > I see.
> > >
> > > > I'd like to hear thoughts on this. At least to document
> > > something about
> > > > it in the manpage?
> > >
> > > If it's OK, I'll do that and implement the ->add_numeric_data()
> > method
> > > like I proposed.
> >
> > Everyone agrees that a statistics module of basic descriptives should
> > return the correct mean (which is currently wrong, by
> > Statistics::Descriptive, if there are NULL values) AND to return
> > the correct count (which is currently right by
> > Statistics::Descriptive, even if there are NULL values). So, if we
> > are to opt for conditional wrappers as solutions, a wrapper like
> > "count_valid()" or even "count_all()" (including invalid) might be
> > more appropriate, at the end of the line, than a wrapper about
> > what type of data are entered in the first place. So adding any
> > type of data is permitted, but only defined numerical data are
> > used in the calculations. Again, the data stored by the
> > "add_data()" method are not necessarily the data that are used
> > when, say, calling "standard_deviation()". This seems reasonable,
> > and expected. (This is the default in commercial statistics
> > packages, but they have the "luxury" of requiring you to specify
> > whether your data are numeric or categorical before they will even
> > run a statistical test.) An "add_anybloodything()" method might
> > wrap around the proper, default one, if necessary
> >
> > [Oh, another request: for an alias to the "standard_deviation" method
> > of "stdev", "std_dev" or similar.]
> >
> >
> >