Subject: | charset issues in ngrams |
Checking out the language modules in Linuga::Identify 0.08 I see that only latin1 chars are used. Omiting accented chars breaks identification of some languages or at least leads to misidentification. For instance Sweedish and Bulgarian get mixed up. Is omiting accented chars a design criteria? And is there are simple and/or programatic way of creating new language packs. I'd like to modify ngrams with larger data set and submit more accurate modules to the distro.