Skip Menu |

This queue is for tickets about the Lingua-EN-Tagger CPAN distribution.

Report information
The Basics
Id: 95714
Status: resolved
Priority: 0/
Queue: Lingua-EN-Tagger

People
Owner: acoburn [...] cpan.org
Requestors: stuart [...] morungos.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Mistagging of abbreviations containing digits
Date: Fri, 16 May 2014 13:04:47 -0400
To: bug-Lingua-EN-Tagger [...] rt.cpan.org
From: Stuart Watt <stuart [...] morungos.com>
I found a tagging issue involving tokens containing digits. We find these a lot in my data: e.g., “FGFR3” is classified as a /CD because the token is tagged as a number. The problem appears to be an unanchored regex for numbers, and here’s the diff for Tagger.pm: Show quoted text
> @@ -582,7 +584,7 @@ sub _classify_unknown_word { > } elsif(m/[\)\]\}]/o){ # Right brackets > $word = "*RRB*"; > > - } elsif (m/-?(?:\p{IsDigit}+(?:\.\p{IsDigit}*)?|\.\p{IsDigit}+)/){ # Floating point number > + } elsif (m/^-?(?:\p{IsDigit}+(?:\.\p{IsDigit}*)?|\.\p{IsDigit}+)/){ # Floating point number > $word = "*NUM*"; > > } elsif (m/^\p{IsDigit}+[\p{IsDigit}\/:-]+\p{IsDigit}/){ # Other number constructs
This is the effect on the output. Before: Mutations/NNS in/IN the/DET RAS/NNP and/CC PIK3CA/CD genes/NNS were/VBD... After: Mutations/NNS in/IN the/DET RAS/NNP and/CC PIK3CA/NNP genes/NNS were/VBD... All the best Stuart -- Stuart Watt stuart@morungos.com / twitter.com/morungos
Download signature.asc
application/pgp-signature 496b

Message body not shown because it is not plain text.

Resolved in the 0.25 release. Thanks for the report!