Bug #19439 for Text-Brew: Text::Brew treats UTF-8 chars as a sequence of bytes

Tue May 23 10:51:30 2006 Guest - Ticket created

Subject:

Text::Brew treats UTF-8 chars as a sequence of bytes

When finding the editing distance between the following strings: vuoddu vuođđu (the second string should contain two consecutive instances of d-stroke (0x0111) between the vowels o and u) Text::Brew reports a distance of 4, instead of the expected 2. The test pair is from Northern Sámi. perl -v: This is perl, v5.8.6 built for darwin-thread-multi-2level uname -a: Darwin a84-231-7-118.elisa-laajakaista.fi 8.6.0 Darwin Kernel Version 8.6.0: Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power Macintosh powerpc (aka MacOS X 10.4.6)

Tue May 23 12:31:43 2006 keith [...] iveys.org - Correspondence added

From:

Keith Ivey

I think there's something else going on that's not specific to the module. Does Text::Levenshtein give you similar problems? This is working for me (gives 2): use Text::Brew 'distance'; print +(distance('vuoddu', "vuo\x{0111}\x{0111}u"))[0], "\n"; This is perl, v5.8.5 built for i386-linux-thread-multi Linux 2.6.12-1.1381_FC3 i686 I haven't really done anything with Text::Brew other than fixing a bug I ran across and thus falling into taking over maintenance, but I can certainly apply a patch if you determine what's going on and find a fix.

Tue May 23 12:31:44 2006 The RT System itself - Status changed from 'new' to 'open'

Tue May 23 12:33:10 2006 keith [...] iveys.org - Taken

Tue May 23 13:37:44 2006 sjur.moshagen [...] samediggi.no - Correspondence added

Subject:	Re: [rt.cpan.org #19439] Text::Brew treats UTF-8 chars as a sequence of bytes
Date:	Tue, 23 May 2006 20:36:31 +0300
To:	bug-Text-Brew [...] rt.cpan.org
From:	Sjur Nørstebø Moshagen <sjur.moshagen [...] samediggi.no>

Den 23. mai. 2006 kl. 19.31 skrev Keith Ivey via RT: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=19439 > > > I think there's something else going on that's not specific to the > module. Does Text::Levenshtein give you similar problems?

I haven't tested yet. But... Show quoted text

> This is > working for me (gives 2): > > use Text::Brew 'distance'; > print +(distance('vuoddu', "vuo\x{0111}\x{0111}u"))[0], "\n";

It gives 2 as result for me as well, whereas use Text::Brew 'distance'; print +(distance('vuoddu', "vuođđu"))[0], "\n"; returns 4. A simpler test: use Text::Brew 'distance'; print +(distance('abc', "ábc"))[0], "\n"; returns 2, whereas: use Text::Brew 'distance'; print +(distance('abc', "\x{00E1}bc"))[0], "\n"; returns 1 as expected.

Tue May 23 13:55:15 2006 sjur.moshagen [...] samediggi.no - Correspondence added

Subject:	Re: [rt.cpan.org #19439] Text::Brew treats UTF-8 chars as a sequence of bytes
Date:	Tue, 23 May 2006 20:54:28 +0300
To:	bug-Text-Brew [...] rt.cpan.org
From:	Sjur Nørstebø Moshagen <sjur.moshagen [...] samediggi.no>

Just some more info that might be relevant: LC_ALL=no_NO.UTF-8 ENV=/Users/sjur/.bashrc __CF_USER_TEXT_ENCODING=0x1F6:37:47 LANG=no_NO.utf8 Sjur

Tue May 23 14:26:41 2006 keith [...] iveys.org - Correspondence added

Show quoted text

> A simpler test: > > use Text::Brew 'distance'; > print +(distance('abc', "ábc"))[0], "\n"; > > returns 2, whereas: > > use Text::Brew 'distance'; > print +(distance('abc', "\x{00E1}bc"))[0], "\n"; > > returns 1 as expected.

Like most Americans, I know less about character encoding than I should, but I think this has to do with perl -- nothing specific to this module. For example, try just printing the length of a string literal that's in UTF-8 in your source. Perl is treating it as a sequence of bytes. If you add "use utf8;" at the start of your script that should solve the problem by telling perl that your source is in UTF-8.

Tue May 23 15:23:44 2006 sjur.moshagen [...] samediggi.no - Correspondence added

Subject:	Re: [rt.cpan.org #19439] Text::Brew treats UTF-8 chars as a sequence of bytes
Date:	Tue, 23 May 2006 22:23:16 +0300
To:	bug-Text-Brew [...] rt.cpan.org
From:	Sjur Nørstebø Moshagen <sjur.moshagen [...] samediggi.no>

Den 23. mai. 2006 kl. 21.26 skrev Keith Ivey via RT: Show quoted text

> Like most Americans, I know less about character encoding than I > should, > but I think this has to do with perl -- nothing specific to this > module. > For example, try just printing the length of a string literal > that's in > UTF-8 in your source. Perl is treating it as a sequence of bytes. > If you > add "use utf8;" at the start of your script that should solve the > problem by telling perl that your source is in UTF-8.

You were right - the problem was with the locale, not with Text::Brew. You can close the bug. One final note for others with similar problems: "use utf8;" doesn't completely do the job. It works fine for string literals, but if one is inputting utf-8 text from a file, and expect perl to set the encoding correctly from a utf-8 locale, the correct 'use' line should be "use open ':locale';". Thanks for the help, and sorry for bothering you! Sjur

Tue May 23 15:29:45 2006 keith [...] iveys.org - Status changed from 'open' to 'rejected'