Bug #118691 for Lingua-EN-Numbers: num2en("00") returns "-zero" plus an undef warning

Thu Nov 10 06:11:33 2016 TIMB [...] cpan.org - Ticket created

Subject:

num2en("00") returns "-zero" plus an undef warning

Due to this code in _int2en: return $D{$1 . '0'} . '-' . $D{$2}; and the %D has not having an entry for "00". Seems reasonable for any number of 0's to be mapped to "zero"s, so "00" -> "zero-zero", "000" -> "zero-zero-zero". p.s. Thanks for the code. Very handy for my current work.

Thu Nov 10 09:38:44 2016 NEILB [...] cpan.org - Correspondence added

Yep, seems reasonable, I'll do a release in the next day or so. Cheers, Neil

Thu Nov 10 09:38:45 2016 The RT System itself - Status changed from 'new' to 'open'

Thu Nov 10 09:38:50 2016 NEILB [...] cpan.org - Taken

Sat Nov 12 09:43:33 2016 NEILB [...] cpan.org - Correspondence added

Sat down to look at this again, thinking "ah yeah, multiple leading zeroes should be compressed down to a single zero", and discovered that wasn't what you had suggested. And after thinking about it, it isn't clear what the right thing to do is. Consider the following cases. 00.1 I think a person would say "nought point one" or "zero point one". 007 Ok, this is an intentionally funny case, but here people would say "oh oh seven" or "zero zero seven". 0700 Here I think someone might say "oh seven hundred" So then I thought about what exactly is this module doing? Converting *numbers* (not digit strings, for example) into words. So I think: 00.1 should be treated as 0.1 007 should be treated as 7 0700 should be treated as 700 What do you think?

Sat Nov 12 14:53:37 2016 TIMB [...] cpan.org - Correspondence added

Show quoted text

>Consider the following cases. > > 00.1 I think a person would say "nought point one" or "zero point one". > 007 Ok, this is an intentionally funny case, but here people would say "oh oh seven" or "zero zero seven". > 0700 Here I think someone might say "oh seven hundred" > > So then I thought about what exactly is this module doing? Converting > *numbers* (not digit strings, for example) into words. > > So I think: > > 00.1 should be treated as 0.1 > 007 should be treated as 7 > 0700 should be treated as 700 > > What do you think?

It hangs on the definition "numbers" and I don't think there would be one solution that would suit all cases. More generally I'd suggest that the module is for converting *number-like strings* into a corresponding sequence of words that aims to match *what a human would say when reading that string*. In my case I'm using it to normalize transcripts so I can compare them. Some transcripts are written by humans, and others by software, all interpreting the same audio. I can see your point that the "num" in num2en suggests that the argument should be numeric (i.e. IV/NV) and arbitrary strings could be assumed to be converted to a number first, e.g. via +=0. If you take that approach then I think there's a clear need for a extra sub that takes a "number-like string" instead. That sub, or perhaps num2en with an extra param, could do the rough equivalent of if (m/^0/) { print "zero " while s/^0//; # handle leading zeros $spell_out_each_digit = 1; # new feature :) } So 0700 would be "zero seven zero zero" not "zero seven hundred". Tim.

Fri Dec 02 06:40:39 2016 TIMB [...] cpan.org - Correspondence added

Another example I just encountered: times/durations like "18:07". Split on word boundaries that's "18" then "07". The "18" returns "eighteen". The "07" returns "-seven" plus an undef warning. Returning "eighteen" then "zero-seven" would be fine.