Skip Menu |

This queue is for tickets about the Lingua-Identify CPAN distribution.

Report information
The Basics
Id: 8670
Status: resolved
Priority: 0/
Queue: Lingua-Identify

People
Owner: cog [...] cpan.org
Requestors: oyku [...] gencay.net
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.08
Fixed in: (no value)



Subject: charset issues in ngrams
Checking out the language modules in Linuga::Identify 0.08 I see that only latin1 chars are used. Omiting accented chars breaks identification of some languages or at least leads to misidentification. For instance Sweedish and Bulgarian get mixed up. Is omiting accented chars a design criteria? And is there are simple and/or programatic way of creating new language packs. I'd like to modify ngrams with larger data set and submit more accurate modules to the distro.
Yes, so far, only latin1 chars are used. The problem with accented chars has been identified from the beginning, and will be taken care of (Lingua::Identify is already usable, but it's still under development; it is still unfinished). Consider this a "temporary design criteria" :-) It will change soon :-) As to a way of creating new language packs, yes, there is a way, just not in the distribution yet. As soon as I get back to Portugal (less than a week, I hope) I'll take care of that. More modules and more accurate ones are, of course, welcome :-) I'll take care of that ASAP.
RT-Send-CC: cog [...] cpan.org
On Tue Nov 30 09:04:00 2004, COG wrote: Show quoted text
> Yes, so far, only latin1 chars are used. The problem with accented chars > has been identified from the beginning, and will be taken care of > (Lingua::Identify is already usable, but it's still under development; > it is still unfinished). > > Consider this a "temporary design criteria" :-) It will change soon :-) > > As to a way of creating new language packs, yes, there is a way, just > not in the distribution yet. As soon as I get back to Portugal (less > than a week, I hope) I'll take care of that. More modules and more > accurate ones are, of course, welcome :-) I'll take care of that ASAP.
Latest versions are UTF-8 aware, so all these problems should be solved now. Reopen bug if needed. Thanks Ambs