Skip Menu |

This queue is for tickets about the Text-Brew CPAN distribution.

Report information
The Basics
Id: 19439
Status: rejected
Priority: 0/
Queue: Text-Brew

People
Owner: keith [...] iveys.org
Requestors: sjur.moshagen [...] samediggi.no
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 0.02
Fixed in: (no value)



Subject: Text::Brew treats UTF-8 chars as a sequence of bytes
When finding the editing distance between the following strings: vuoddu vuođđu (the second string should contain two consecutive instances of d-stroke (0x0111) between the vowels o and u) Text::Brew reports a distance of 4, instead of the expected 2. The test pair is from Northern Sámi. perl -v: This is perl, v5.8.6 built for darwin-thread-multi-2level uname -a: Darwin a84-231-7-118.elisa-laajakaista.fi 8.6.0 Darwin Kernel Version 8.6.0: Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power Macintosh powerpc (aka MacOS X 10.4.6)
From: Keith Ivey
I think there's something else going on that's not specific to the module. Does Text::Levenshtein give you similar problems? This is working for me (gives 2): use Text::Brew 'distance'; print +(distance('vuoddu', "vuo\x{0111}\x{0111}u"))[0], "\n"; This is perl, v5.8.5 built for i386-linux-thread-multi Linux 2.6.12-1.1381_FC3 i686 I haven't really done anything with Text::Brew other than fixing a bug I ran across and thus falling into taking over maintenance, but I can certainly apply a patch if you determine what's going on and find a fix.
Subject: Re: [rt.cpan.org #19439] Text::Brew treats UTF-8 chars as a sequence of bytes
Date: Tue, 23 May 2006 20:36:31 +0300
To: bug-Text-Brew [...] rt.cpan.org
From: Sjur Nørstebø Moshagen <sjur.moshagen [...] samediggi.no>
Den 23. mai. 2006 kl. 19.31 skrev Keith Ivey via RT: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=19439 > > > I think there's something else going on that's not specific to the > module. Does Text::Levenshtein give you similar problems?
I haven't tested yet. But... Show quoted text
> This is > working for me (gives 2): > > use Text::Brew 'distance'; > print +(distance('vuoddu', "vuo\x{0111}\x{0111}u"))[0], "\n";
It gives 2 as result for me as well, whereas use Text::Brew 'distance'; print +(distance('vuoddu', "vuođđu"))[0], "\n"; returns 4. A simpler test: use Text::Brew 'distance'; print +(distance('abc', "ábc"))[0], "\n"; returns 2, whereas: use Text::Brew 'distance'; print +(distance('abc', "\x{00E1}bc"))[0], "\n"; returns 1 as expected.
Subject: Re: [rt.cpan.org #19439] Text::Brew treats UTF-8 chars as a sequence of bytes
Date: Tue, 23 May 2006 20:54:28 +0300
To: bug-Text-Brew [...] rt.cpan.org
From: Sjur Nørstebø Moshagen <sjur.moshagen [...] samediggi.no>
Just some more info that might be relevant: LC_ALL=no_NO.UTF-8 ENV=/Users/sjur/.bashrc __CF_USER_TEXT_ENCODING=0x1F6:37:47 LANG=no_NO.utf8 Sjur
Show quoted text
> A simpler test: > > use Text::Brew 'distance'; > print +(distance('abc', "ábc"))[0], "\n"; > > returns 2, whereas: > > use Text::Brew 'distance'; > print +(distance('abc', "\x{00E1}bc"))[0], "\n"; > > returns 1 as expected.
Like most Americans, I know less about character encoding than I should, but I think this has to do with perl -- nothing specific to this module. For example, try just printing the length of a string literal that's in UTF-8 in your source. Perl is treating it as a sequence of bytes. If you add "use utf8;" at the start of your script that should solve the problem by telling perl that your source is in UTF-8.
Subject: Re: [rt.cpan.org #19439] Text::Brew treats UTF-8 chars as a sequence of bytes
Date: Tue, 23 May 2006 22:23:16 +0300
To: bug-Text-Brew [...] rt.cpan.org
From: Sjur Nørstebø Moshagen <sjur.moshagen [...] samediggi.no>
Den 23. mai. 2006 kl. 21.26 skrev Keith Ivey via RT: Show quoted text
> Like most Americans, I know less about character encoding than I > should, > but I think this has to do with perl -- nothing specific to this > module. > For example, try just printing the length of a string literal > that's in > UTF-8 in your source. Perl is treating it as a sequence of bytes. > If you > add "use utf8;" at the start of your script that should solve the > problem by telling perl that your source is in UTF-8.
You were right - the problem was with the locale, not with Text::Brew. You can close the bug. One final note for others with similar problems: "use utf8;" doesn't completely do the job. It works fine for string literals, but if one is inputting utf-8 text from a file, and expect perl to set the encoding correctly from a utf-8 locale, the correct 'use' line should be "use open ':locale';". Thanks for the help, and sorry for bothering you! Sjur