Bug #95714 for Lingua-EN-Tagger: Mistagging of abbreviations containing digits

Fri May 16 13:05:12 2014 stuart [...] morungos.com - Ticket created

Subject:	Mistagging of abbreviations containing digits
Date:	Fri, 16 May 2014 13:04:47 -0400
To:	bug-Lingua-EN-Tagger [...] rt.cpan.org
From:	Stuart Watt <stuart [...] morungos.com>

I found a tagging issue involving tokens containing digits. We find these a lot in my data: e.g., “FGFR3” is classified as a /CD because the token is tagged as a number. The problem appears to be an unanchored regex for numbers, and here’s the diff for Tagger.pm: Show quoted text

> @@ -582,7 +584,7 @@ sub _classify_unknown_word { > } elsif(m/[\)\]\}]/o){ # Right brackets > $word = "*RRB*"; > > - } elsif (m/-?(?:\p{IsDigit}+(?:\.\p{IsDigit}*)?|\.\p{IsDigit}+)/){ # Floating point number > + } elsif (m/^-?(?:\p{IsDigit}+(?:\.\p{IsDigit}*)?|\.\p{IsDigit}+)/){ # Floating point number > $word = "*NUM*"; > > } elsif (m/^\p{IsDigit}+[\p{IsDigit}\/:-]+\p{IsDigit}/){ # Other number constructs

This is the effect on the output. Before: Mutations/NNS in/IN the/DET RAS/NNP and/CC PIK3CA/CD genes/NNS were/VBD... After: Mutations/NNS in/IN the/DET RAS/NNP and/CC PIK3CA/NNP genes/NNS were/VBD... All the best Stuart -- Stuart Watt stuart@morungos.com / twitter.com/morungos

Download signature.asc
application/pgp-signature 496b

Message body not shown because it is not plain text.

Fri Apr 03 07:36:10 2015 acoburn [...] cpan.org - Correspondence added

Resolved in the 0.25 release. Thanks for the report!

Fri Apr 03 07:36:10 2015 The RT System itself - Status changed from 'new' to 'open'

Fri Apr 03 07:36:14 2015 acoburn [...] cpan.org - Status changed from 'open' to 'resolved'

Fri Apr 03 07:36:21 2015 acoburn [...] cpan.org - Taken