Subject: | Unicode encoding failure for i-macron + e-grave + i-macron |
Date: | Mon, 10 Jan 2011 16:47:07 -0600 |
To: | bug-Encode [...] rt.cpan.org |
From: | Rory M Larson <rlarson [...] unlnotes.unl.edu> |
Hello,
I believe I have run into a corruption bug in the Encode package,
apparently 2.42 and more certainly in 2.39. I am running ActivePerl
5.12.2 on a Microsoft Windows Vista Ultimate SP2 platform.
I am working on a language dictionary, starting from Latin mark-up
character data in an input file, which is to be converted to Unicode with
accented vowels and raised n's in the output. I read in my typed files,
and convert the foreign language material using s/[markup
characters]/[Unicode output character]/g. This works very well for most
words.
I have run into one particular sequence that causes output file
corruption, however. Using e4 for e-grave and i1 for i-macron, and the
input string i1e4i1 intended to convert to i-macron + e-grave + i-macron,
I get corrupted output.
my $instring = 'i1e4i1' # input string
$instring =~ s/e4/$e_grave/g # converts e4 to e-grave
$instring =~ s/i1/$i_macron/g # converts i1's to i-macron; fails with
file corruption
I can add other characters before, after, and in between the i1-e4-i1
characters, and get the same results.
But if I first substitute a higher Unicode character after the second i1,
then it succeeds nicely. The problem only occurs when a word with
i-macron + e-grave + i-macron doesn't legitimately have such another
character.
I've been struggling with this for about a week trying to pin it down, and
I have found some other combinations that cause this sort of corruption as
well. If you need more information beyond the i1-e4-i1 sample, please let
me know.
Thanks for your time. I'm including my Perl code illustrating this sample
below. It produces the following three output files for me:
Rory
#!/usr/bin/perl
# outf8test2.plx
use warnings;
use strict;
use Encode;
#######################################################################
###
### outf8test2.plx
###
### This script illustrates an odd glitch in the process of encoding
### Unicode characters from marked up Latin characters. The character
### pair e4 should be converted to e-grave, i1 should be converted to
### i-macron, and capital N should be converted to raised-n.
my $e_grave = chr(0x00e8); # Unicode character for e-grave
my $i_macron = chr(0x012b); # Unicode character for i-macron
my $nasal = chr(0x207f); # Unicode character for raised-n
my $utf8; # UTF8 Output file handle
### The character sequence i-macron + e-grave + raised-n outputs to
### unrecognized Unicode. In the output file, e4 changes correctly
### to e-grave, but both i1 => i-macron characters appear as the pair
### A-umlaut + [a double left angle bracket (less-than sign)]
my $inlinex1 = 'i1e4i1'; # i1-e4-i1
$inlinex1 =~ s/e4/$e_grave/g; # Convert e4 to e-grave
$inlinex1 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: fails
my $outfilex1 = 'outx1.utf8';
open($utf8, '>:utf8', $outfilex1);
print $utf8 "$inlinex1\n";
close($utf8);
### If I try to read the file just produced, I get an error message:
### utf8 "\xE8" does not map to Unicode at outf8test2.plx line 48.
### Wide character in print at outf8test2.plx line 52.
### [-1/2]\xE8[-1/2]
### ([-1/2] is a single special character.)
my $utf8in;
my $infile = $outfilex1;
my $inline;
open($utf8in, '<:encoding(utf8)', $infile) or die $!;
$inline = readline $utf8in;
$inline =~ s/\s+$//g;
close($utf8in);
print "$inline\n";
### Adding characters before, after, or between the -i1-e4-i1- pairs
### makes no difference. The unrecognized characters for i1 still
### appear instead.
my $inlinex2 = 'ni1qubee4dubai1snare'; # n-i1-qube-e4-duba-i1-snare
$inlinex2 =~ s/e4/$e_grave/g; # Convert e4 to e-grave
$inlinex2 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: fails
my $outfilex2 = 'outx2.utf8';
open($utf8, '>:utf8', $outfilex2);
print $utf8 "$inlinex2\n";
close($utf8);
### But if I add a higher Unicode character after the second i1,
### and substitute it before I substitute the i1's, then
### everything comes out fine as expected. I indeed get
### i-macron + e-grave + i-macron + raised-n.
my $inlinex3 = 'i1e4i1N'; # i1-e4-i1-N
$inlinex3 =~ s/N/$nasal/g; # Convert N to raised-n
$inlinex3 =~ s/e4/$e_grave/g; # Convert e4 to e-grave
$inlinex3 =~ s/i1/$i_macron/g; # Convert i1 to i-macron: succeeds
my $outfilex3 = 'outx3.utf8';
open($utf8, '>:utf8', $outfilex3);
print $utf8 "$inlinex3\n";
close($utf8);
### I am running this on a Windows Vista Ultimate SP2 platform.
### I am using ActivePerl 5.12.2.
### Encode is either 2.42 (installed) or 2.39, which I can't seem to
uninstall.
Message body not shown because it is not plain text.
Message body not shown because it is not plain text.
Message body not shown because it is not plain text.