Subject: | apostrophe s in English stemmer |
In Lingua::Stem::Snowball version 0.92, English words which end in apostrophe s, such as "ranger's" lose the s but keep the apostrophe. This requires a wasteful preprocessing pass on text to be stemmed to strip all apostrophe-s instances with...
s/'s$//;
It also means that if you need to use the unmodified stemmable text for some other purpose, you must make a copy of the entire array.
These problems have workarounds, albeit expensive ones. However, they require that the user be aware in the first place of the bizarre behavior of the stemmer. No one expects a user to enter "ranger'" into a search box. And although the Lingua::Stem module has its own quirks (e.g. deletion of any tokens containing digits), it handles the apostrophe-s as you would expect.
The preferred solution would be to change the behavior of the stemmer. If that is not possible, the documentation should inform the user that they must strip apostrophe-s themselves.
Here is a program which demonstrates the behavior.
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::Stem::Snowball;
my $snowball = Lingua::Stem::Snowball->new( lang => 'en' );
my @stemmable = ( 'foo', "ranger's", 'bar' );
my @stemmed = $snowball->stem(\@stemmable);
print "Snowball: @stemmed\n";