Subject: | Mistagging of abbreviations containing digits |
Date: | Fri, 16 May 2014 13:04:47 -0400 |
To: | bug-Lingua-EN-Tagger [...] rt.cpan.org |
From: | Stuart Watt <stuart [...] morungos.com> |
I found a tagging issue involving tokens containing digits. We find these a lot in my data: e.g., “FGFR3” is classified as a /CD because the token is tagged as a number. The problem appears to be an unanchored regex for numbers, and here’s the diff for Tagger.pm:
Show quoted text
> @@ -582,7 +584,7 @@ sub _classify_unknown_word {
> } elsif(m/[\)\]\}]/o){ # Right brackets
> $word = "*RRB*";
>
> - } elsif (m/-?(?:\p{IsDigit}+(?:\.\p{IsDigit}*)?|\.\p{IsDigit}+)/){ # Floating point number
> + } elsif (m/^-?(?:\p{IsDigit}+(?:\.\p{IsDigit}*)?|\.\p{IsDigit}+)/){ # Floating point number
> $word = "*NUM*";
>
> } elsif (m/^\p{IsDigit}+[\p{IsDigit}\/:-]+\p{IsDigit}/){ # Other number constructs
This is the effect on the output.
Before: Mutations/NNS in/IN the/DET RAS/NNP and/CC PIK3CA/CD genes/NNS were/VBD...
After: Mutations/NNS in/IN the/DET RAS/NNP and/CC PIK3CA/NNP genes/NNS were/VBD...
All the best
Stuart
--
Stuart Watt
stuart@morungos.com / twitter.com/morungos
Message body not shown because it is not plain text.