Subject: | Wrong encode_utf8 result for utf8 string |
Hello.
Detailed info about modules', perl's and OSs' versions are below.
Short description of the issue:
$a = "äöüéà"; # is the utf8 string with non English letters in source code
Encode::encode_utf8( $a ); # returns wrong result
Please take a look on the following info to ensure.
Full description of the issue:
I am using the perl script written in utf8 encoding with predefined non
English variables. I.e. all predefined non English strings have non
English characters in utf8 charset.
For example,
$a = "äöüéà";
i.e. it's the same as
$a = "\x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0}";
i.e. it's really a "äöüéà" in utf8 charset ;)
To ensure that this script will work on the "non utf8" system (for
example, on latin1, i.e. ISO-8859-1) and to send correct data via
network, I encode the predefined variable to utf8 by Encode::encode_utf8.
And the result string is not a utf8 string. I.e. the result differs from
the predefined string in utf8 source code.
I.e. "äöüéà" is encoded to "äöüéà " ;)
I.e. "\x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0}" (10
bytes) becomes
"\x{c3}\x{83}\x{c2}\x{a4}\x{c3}\x{83}\x{c2}\x{b6}\x{c3}\x{83}\x{c2}\x{bc}\x{c3}\x{83}\x{c2}\x{a9}\x{c3}\x{83}\x{c2}\x{a0}"
(20 bytes)
But the result of Encode::encode_utf8 should be the same.
I.e. "äöüéà" should be encoded to "äöüéà".
Please let me know if I should explain this (i.e. why utf8 string should
be not changed after encoding to utf8).
Could you please correct this issue. Or please let me know if you need
more info about it.
Thank you
P.S. I am adding the perl script and it's log to illustrate this issue.
P.P.S. Here is the detailed info for investigation:
Module version:
Show quoted text
cpan> i Encode
Module id = Encode
... (missed)
CPAN_VERSION 2.43
... (missed)
INST_VERSION 2.39
Reproduced on OS (uname -a):
Oracle Enterprise Linux 5.4 x86_64
Linux ... 2.6.18-164.0.0.0.1.el5 #1 SMP Thu Sep 3 00:21:28 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux
Fedora 14 Linux x86_64
Linux ... 2.6.35.13-92.fc14.x86_64 #1 SMP Sat May 21 17:26:25 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux
Fedora 14 Linux i386
Linux ... 2.6.35.13-92.fc14.i686 #1 SMP Sat May 21 17:39:42 UTC 2011
i686 i686 i386 GNU/Linux
Perl version (i.e. perl -v):
on Oracle Enterprise Linux 5.4 x86_64:
This is perl, v5.8.8 built for x86_64-linux-thread-multi
on Fedora 14 Linux x86_64:
This is perl 5, version 12, subversion 3 (v5.12.3) built for
x86_64-linux-thread-multi
on Fedora 14 Linux i386:
This is perl 5, version 12, subversion 3 (v5.12.3) built for
i386-linux-thread-multi
Please note, that locale returns utf8 too:
$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
Subject: | utf8_to_utf8.pl |
#!/usr/bin/perl
use strict;
use Encode;
use bytes; # to show length in bytes (i.e. not in characters)
sub fcStringShow {
my $v = shift;
print "String: {{" . $v . "}}\n";
print "\tlength = " . length( $v ) . "\n";
print "\t"; my $i = 0; while ( $i < length( $v ) ) {
printf "\\x{%.2x}" , ord( substr( $v , $i ) );
$i = $i + 1;
} print "\n";
}
sub fcTest {
my $a = shift;
print "Not encoded string is: " . $a . "\n";
fcStringShow( $a );
my $b = Encode::encode_utf8( $a );
print "Encoded string is: " . $b . "\n";
fcStringShow( $b );
print "\n";
}
fcTest( "abcde" );
fcTest( "äöüéà " );
Subject: | utf8_to_utf8.log |
Message body not shown because it is not plain text.