Bug #64662 for Encode: Unicode encoding failure for i-macron + e-grave + i-macron

Mon Jan 10 17:47:57 2011 rlarson [...] unlnotes.unl.edu - Ticket created

Subject:	Unicode encoding failure for i-macron + e-grave + i-macron
Date:	Mon, 10 Jan 2011 16:47:07 -0600
To:	bug-Encode [...] rt.cpan.org
From:	Rory M Larson <rlarson [...] unlnotes.unl.edu>

Hello, I believe I have run into a corruption bug in the Encode package, apparently 2.42 and more certainly in 2.39. I am running ActivePerl 5.12.2 on a Microsoft Windows Vista Ultimate SP2 platform. I am working on a language dictionary, starting from Latin mark-up character data in an input file, which is to be converted to Unicode with accented vowels and raised n's in the output. I read in my typed files, and convert the foreign language material using s/[markup characters]/[Unicode output character]/g. This works very well for most words. I have run into one particular sequence that causes output file corruption, however. Using e4 for e-grave and i1 for i-macron, and the input string i1e4i1 intended to convert to i-macron + e-grave + i-macron, I get corrupted output. my $instring = 'i1e4i1' # input string $instring =~ s/e4/$e_grave/g # converts e4 to e-grave $instring =~ s/i1/$i_macron/g # converts i1's to i-macron; fails with file corruption I can add other characters before, after, and in between the i1-e4-i1 characters, and get the same results. But if I first substitute a higher Unicode character after the second i1, then it succeeds nicely. The problem only occurs when a word with i-macron + e-grave + i-macron doesn't legitimately have such another character. I've been struggling with this for about a week trying to pin it down, and I have found some other combinations that cause this sort of corruption as well. If you need more information beyond the i1-e4-i1 sample, please let me know. Thanks for your time. I'm including my Perl code illustrating this sample below. It produces the following three output files for me: Rory #!/usr/bin/perl # outf8test2.plx use warnings; use strict; use Encode; ####################################################################### ### ### outf8test2.plx ### ### This script illustrates an odd glitch in the process of encoding ### Unicode characters from marked up Latin characters. The character ### pair e4 should be converted to e-grave, i1 should be converted to ### i-macron, and capital N should be converted to raised-n. my $e_grave = chr(0x00e8); # Unicode character for e-grave my $i_macron = chr(0x012b); # Unicode character for i-macron my $nasal = chr(0x207f); # Unicode character for raised-n my $utf8; # UTF8 Output file handle ### The character sequence i-macron + e-grave + raised-n outputs to ### unrecognized Unicode. In the output file, e4 changes correctly ### to e-grave, but both i1 => i-macron characters appear as the pair ### A-umlaut + [a double left angle bracket (less-than sign)] my $inlinex1 = 'i1e4i1'; # i1-e4-i1 $inlinex1 =~ s/e4/$e_grave/g; # Convert e4 to e-grave $inlinex1 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: fails my $outfilex1 = 'outx1.utf8'; open($utf8, '>:utf8', $outfilex1); print $utf8 "$inlinex1\n"; close($utf8); ### If I try to read the file just produced, I get an error message: ### utf8 "\xE8" does not map to Unicode at outf8test2.plx line 48. ### Wide character in print at outf8test2.plx line 52. ### [-1/2]\xE8[-1/2] ### ([-1/2] is a single special character.) my $utf8in; my $infile = $outfilex1; my $inline; open($utf8in, '<:encoding(utf8)', $infile) or die $!; $inline = readline $utf8in; $inline =~ s/\s+$//g; close($utf8in); print "$inline\n"; ### Adding characters before, after, or between the -i1-e4-i1- pairs ### makes no difference. The unrecognized characters for i1 still ### appear instead. my $inlinex2 = 'ni1qubee4dubai1snare'; # n-i1-qube-e4-duba-i1-snare $inlinex2 =~ s/e4/$e_grave/g; # Convert e4 to e-grave $inlinex2 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: fails my $outfilex2 = 'outx2.utf8'; open($utf8, '>:utf8', $outfilex2); print $utf8 "$inlinex2\n"; close($utf8); ### But if I add a higher Unicode character after the second i1, ### and substitute it before I substitute the i1's, then ### everything comes out fine as expected. I indeed get ### i-macron + e-grave + i-macron + raised-n. my $inlinex3 = 'i1e4i1N'; # i1-e4-i1-N $inlinex3 =~ s/N/$nasal/g; # Convert N to raised-n $inlinex3 =~ s/e4/$e_grave/g; # Convert e4 to e-grave $inlinex3 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: succeeds my $outfilex3 = 'outx3.utf8'; open($utf8, '>:utf8', $outfilex3); print $utf8 "$inlinex3\n"; close($utf8); ### I am running this on a Windows Vista Ultimate SP2 platform. ### I am using ActivePerl 5.12.2. ### Encode is either 2.42 (installed) or 2.39, which I can't seem to uninstall.

Download outx1.utf8
application/octet-stream 7b

Message body not shown because it is not plain text.

Download outx2.utf8
application/octet-stream 21b

Message body not shown because it is not plain text.

Download outx3.utf8
application/octet-stream 11b

Message body not shown because it is not plain text.

Sat May 21 18:19:53 2011 DANKOGAI [...] cpan.org - Correspondence added

You forgot to say 'use utf8;'. read perldoc perluniintro again. Dan the Encode Maintainer On Mon Jan 10 17:47:57 2011, rlarson@unlnotes.unl.edu wrote: Show quoted text

> Hello, > > I believe I have run into a corruption bug in the Encode package, > apparently 2.42 and more certainly in 2.39. I am running ActivePerl > 5.12.2 on a Microsoft Windows Vista Ultimate SP2 platform. > > I am working on a language dictionary, starting from Latin mark-up > character data in an input file, which is to be converted to Unicode with > accented vowels and raised n's in the output. I read in my typed files, > and convert the foreign language material using s/[markup > characters]/[Unicode output character]/g. This works very well for most > words. > > I have run into one particular sequence that causes output file > corruption, however. Using e4 for e-grave and i1 for i-macron, and the > input string i1e4i1 intended to convert to i-macron + e-grave + i-macron, > I get corrupted output. > > my $instring = 'i1e4i1' # input string > > $instring =~ s/e4/$e_grave/g # converts e4 to e-grave > $instring =~ s/i1/$i_macron/g # converts i1's to i-macron; fails with > file corruption > > I can add other characters before, after, and in between the i1-e4-i1 > characters, and get the same results. > > But if I first substitute a higher Unicode character after the second i1, > then it succeeds nicely. The problem only occurs when a word with > i-macron + e-grave + i-macron doesn't legitimately have such another > character. > > I've been struggling with this for about a week trying to pin it down, and > I have found some other combinations that cause this sort of corruption as > well. If you need more information beyond the i1-e4-i1 sample, please let > me know. > > Thanks for your time. I'm including my Perl code illustrating this sample > below. It produces the following three output files for me: > > > > Rory > > > > #!/usr/bin/perl > # outf8test2.plx > use warnings; > use strict; > use Encode; > >

####################################################################### Show quoted text

> ### > ### outf8test2.plx > ### > ### This script illustrates an odd glitch in the process of encoding > ### Unicode characters from marked up Latin characters. The character > ### pair e4 should be converted to e-grave, i1 should be converted to > ### i-macron, and capital N should be converted to raised-n. > > > my $e_grave = chr(0x00e8); # Unicode character for e-grave > my $i_macron = chr(0x012b); # Unicode character for i-macron > my $nasal = chr(0x207f); # Unicode character for raised-n > > > > my $utf8; # UTF8 Output file handle > > > ### The character sequence i-macron + e-grave + raised-n outputs to > ### unrecognized Unicode. In the output file, e4 changes correctly > ### to e-grave, but both i1 => i-macron characters appear as the pair > ### A-umlaut + [a double left angle bracket (less-than sign)] > > my $inlinex1 = 'i1e4i1'; # i1-e4-i1 > > $inlinex1 =~ s/e4/$e_grave/g; # Convert e4 to e-grave > $inlinex1 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: fails > > my $outfilex1 = 'outx1.utf8'; > open($utf8, '>:utf8', $outfilex1); > print $utf8 "$inlinex1\n"; > close($utf8); > > > ### If I try to read the file just produced, I get an error message: > ### utf8 "\xE8" does not map to Unicode at outf8test2.plx line 48. > ### Wide character in print at outf8test2.plx line 52. > ### [-1/2]\xE8[-1/2] > ### ([-1/2] is a single special character.) > > my $utf8in; > my $infile = $outfilex1; > my $inline; > > open($utf8in, '<:encoding(utf8)', $infile) or die $!; > $inline = readline $utf8in; > $inline =~ s/\s+$//g; > close($utf8in); > > print "$inline\n"; > > > > ### Adding characters before, after, or between the -i1-e4-i1- pairs > ### makes no difference. The unrecognized characters for i1 still > ### appear instead. > > my $inlinex2 = 'ni1qubee4dubai1snare'; # n-i1-qube-e4-duba-i1-snare > > $inlinex2 =~ s/e4/$e_grave/g; # Convert e4 to e-grave > $inlinex2 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: fails > > my $outfilex2 = 'outx2.utf8'; > open($utf8, '>:utf8', $outfilex2); > print $utf8 "$inlinex2\n"; > close($utf8); > > > > ### But if I add a higher Unicode character after the second i1, > ### and substitute it before I substitute the i1's, then > ### everything comes out fine as expected. I indeed get > ### i-macron + e-grave + i-macron + raised-n. > > my $inlinex3 = 'i1e4i1N'; # i1-e4-i1-N > > $inlinex3 =~ s/N/$nasal/g; # Convert N to raised-n > $inlinex3 =~ s/e4/$e_grave/g; # Convert e4 to e-grave > $inlinex3 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: succeeds > > my $outfilex3 = 'outx3.utf8'; > open($utf8, '>:utf8', $outfilex3); > print $utf8 "$inlinex3\n"; > close($utf8); > > > ### I am running this on a Windows Vista Ultimate SP2 platform. > ### I am using ActivePerl 5.12.2. > ### Encode is either 2.42 (installed) or 2.39, which I can't seem to > uninstall. > > > > >

Sat May 21 18:19:54 2011 The RT System itself - Status changed from 'new' to 'open'

Sat May 21 18:19:54 2011 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'