Skip Menu |

This queue is for tickets about the Text-Levenshtein CPAN distribution.

Report information
The Basics
Id: 97883
Status: resolved
Priority: 0/
Queue: Text-Levenshtein

People
Owner: NEILB [...] cpan.org
Requestors: olaf [...] wundersolutions.com
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 0.09
Fixed in: 0.11



Subject: Optionally use Unicode::Collate::eq()
Hi Neil! Consider this example: use strict; use warnings; use utf8; use feature qw( say ); use Text::Levenshtein qw( distance ); use Unicode::Collate; binmode STDOUT, ':encoding(UTF-8)'; my @cities = ( 'Swidnica', 'Świdnica' ); my $collator = Unicode::Collate->new( normalization => undef, level => 1 ); say $collator->eq( @cities ) ? 'exact match' : 'no match'; say 'edit distance ' . distance( @cities ); ### Output is the following: exact match edit distance 1 What I'm proposing is the ability to override the eq being used in this module so that words that are equivalent as ASCII return an edit distance of 0. My use case is that I've got a pile of geographical data and I'm trying to see how accurate it is. Do the city names provided match the city names in the database etc It would be great if something like this could be accounted for when calculating edit distance. No idea if that violates the philosophy behind this code, but figured it was useful to ask. Thanks, Olaf
I should add that I am happy to provide a patch.
Hi Olaf, Yeah, this seems like a good option to add. Something like: use Text::Levenshtein qw/ distance -ignore-diacritics /; I'm just trying to think of the best way to do this, without compromising performance for the regular case. Quickest solution is to have two functions fastdistance_with_diacritics fastdistance_without_diacritics And then depending on which option you ask for, *fastdistance = \&fastdistance_without_diacritics This optimises for performance, but there would be a lot of cut & paste. In the perl 4 days, I would have built up the function as a string and eval'd it. Suggestions? Neil
Ok, that's a non-starter, as it would break where one bit of code in an app uses it without diacritics, and somewhere else in the same app uses it with diacritics.
Subject: Re: [rt.cpan.org #97883] Optionally use Unicode::Collate::eq()
Date: Fri, 8 Aug 2014 22:55:40 -0700
To: bug-Text-Levenshtein [...] rt.cpan.org
From: Josh Goldberg <josh [...] 3io.com>
optional args hashref to fastdistance as a final parameter? if ($args->{without_diacritics}) { remove_diacritics(\$word1); remove_diacritics(\$word2); } On Fri, Aug 8, 2014 at 1:49 PM, Neil_Bowers via RT < bug-Text-Levenshtein@rt.cpan.org> wrote: Show quoted text
> Queue: Text-Levenshtein > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=97883 > > > Ok, that's a non-starter, as it would break where one bit of code in an > app uses it without diacritics, and somewhere else in the same app uses it > with diacritics. > >
RT-Send-CC: josh [...] 3io.com
On Fri Aug 08 16:28:13 2014, NEILB wrote: Show quoted text
> Hi Olaf, > > Yeah, this seems like a good option to add. Something like: > > use Text::Levenshtein qw/ distance -ignore-diacritics /; > > I'm just trying to think of the best way to do this, without > compromising performance for the regular case. > > Quickest solution is to have two functions > > fastdistance_with_diacritics > fastdistance_without_diacritics > > And then depending on which option you ask for, > > *fastdistance = \&fastdistance_without_diacritics > > This optimises for performance, but there would be a lot of cut & > paste. > > In the perl 4 days, I would have built up the function as a string and > eval'd it. > > Suggestions? > > Neil
One really simple way would be to allow the user to provide her own callback to replace the use of eq(). The module would default to using eq, but use the callback if provided. That would make it easy to create another module on CPAN that just supplies its own callback which ignores diacritics. People could possibly add their own transformations as well (like lower casing before comparison etc) and that wouldn't be your responsibility to implement. Olaf
I've done a first brain-dump of thoughts related to how we could approach this: https://docs.google.com/document/d/19-6u9nGxeHvMnRLf9wNLpuNv5opLNjhMWPgitnUgFtA/edit# Anyone with the link can edit, so feel free to add additional approaches, or comments on the existing ones. Neil
Hi Olaf, I've added support for optional arguments, with the only optional argument at the moment being ignore_diacritics. Make sure you install 0.11 and not 0.10, as the latter was very inefficient :-) Cheers, Neil