Skip Menu |

This queue is for tickets about the Text-Unidecode CPAN distribution.

Report information
The Basics
Id: 96747
Status: rejected
Priority: 0/
Queue: Text-Unidecode

People
Owner: sburke [...] cpan.org
Requestors: apostole [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: characters transliterated to non characters
Date: Thu, 26 Jun 2014 22:12:41 +0200
To: bug-Text-Unidecode [...] rt.cpan.org
From: Apostol Karovski <zapirkon [...] gmail.com>
For example 018F, 0259, and other Unicode characters transliterate to "@". It seems like characters with pronunciation similar to [æ] are transliterated to "@" Why not make them transliterate to "a" or "e" or "ae"? I am noting this because words should contain letters, but @ is not a letter and it almost always means "at". Furthermore, logically, if I have a word transliterated, for every character in the new word, Character.isLetter() should return True. Other thing that bothers me is the transliteration to numerical representation (Example: 0184, 0185, 018E are represented as "6", "6" and "3" accordingly. Here, 018E is maybe even wrong since it is described as reversed E) P.S. Are the tables in anyway managed? Can I get insight of how they are made and maintained?
On Thu Jun 26 16:13:10 2014, apostole@gmail.com wrote: Show quoted text
>[...]Furthermore, logically, if I have a > word transliterated, for every character in the new word, > Character.isLetter() should return True.
Not only is that not a goal of mine, the documentation says that *that specifically* is something you can't assume. Section "DESIGN GOALS AND CONSTRAINTS": « For example, if you assume an all-alphabetic (Unicode) string passed to unidecode(...) will return an all-alphabetic string, you're wrong-- some alphabetic Unicode characters are transliterated as strings containing punctuation (e.g., the Armenian letter "Թ" (U+0539), currently transliterates as "T`" (capital-T then a backtick). » As to "@" for schwa, that's a convention that me and other linguists have used when we've needed to do pseudo-IPA in 7-bit. I didn't make it up from nothing-- and I think I'll leave it the way it is, because... Show quoted text
> Other thing that bothers me is the transliteration to numerical
...we see things differently. You, I, and other users would choose different approaches to transliteration for particular blocks. In *many* cases, I went for graphic similarity, hence Ǝ → 3. It sounds like we have different philosophies for U+01xx and U+02xx. See the documentation "WHEN YOU DON'T LIKE WHAT UNIDECODE DOES". As HL Mencken one said: You may be right. Anything worth doing right, is worth you doing right the way you like it. And then pass off to Unidecode to do cleanup if it has any Malayalam, or Greek, or Tibetan, or fullwidth characters, etc. Unicode is big enough that everyone will find some part of Unidecode that seems totally wrongheaded. (Many of them are Chinese and they are very angry... and wonderfully contradictory. I wish I could introduce them all to eachother.) Show quoted text
> P.S. Are the tables in anyway managed? Can I get insight of how they are > made and maintained?
I made them, and I manage them locally. For insight, read all the new documentation, and also the Perl Journal article about it: http://interglacial.com/tpj/22/