Bug #69326 for Encode: Wrong encode_utf8 result for utf8 string

Wed Jul 06 22:41:29 2011 marbuga [...] gmail.com - Ticket created

Subject:

Wrong encode_utf8 result for utf8 string

Hello. Detailed info about modules', perl's and OSs' versions are below. Short description of the issue: $a = "äöüéà"; # is the utf8 string with non English letters in source code Encode::encode_utf8( $a ); # returns wrong result Please take a look on the following info to ensure. Full description of the issue: I am using the perl script written in utf8 encoding with predefined non English variables. I.e. all predefined non English strings have non English characters in utf8 charset. For example, $a = "äöüéà"; i.e. it's the same as $a = "\x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0}"; i.e. it's really a "äöüéà" in utf8 charset ;) To ensure that this script will work on the "non utf8" system (for example, on latin1, i.e. ISO-8859-1) and to send correct data via network, I encode the predefined variable to utf8 by Encode::encode_utf8. And the result string is not a utf8 string. I.e. the result differs from the predefined string in utf8 source code. I.e. "äöüéà" is encoded to "Ã¤Ã¶Ã¼Ã©Ã " ;) I.e. "\x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0}" (10 bytes) becomes "\x{c3}\x{83}\x{c2}\x{a4}\x{c3}\x{83}\x{c2}\x{b6}\x{c3}\x{83}\x{c2}\x{bc}\x{c3}\x{83}\x{c2}\x{a9}\x{c3}\x{83}\x{c2}\x{a0}" (20 bytes) But the result of Encode::encode_utf8 should be the same. I.e. "äöüéà" should be encoded to "äöüéà". Please let me know if I should explain this (i.e. why utf8 string should be not changed after encoding to utf8). Could you please correct this issue. Or please let me know if you need more info about it. Thank you P.S. I am adding the perl script and it's log to illustrate this issue. P.P.S. Here is the detailed info for investigation: Module version: Show quoted text

cpan> i Encode

Module id = Encode ... (missed) CPAN_VERSION 2.43 ... (missed) INST_VERSION 2.39 Reproduced on OS (uname -a): Oracle Enterprise Linux 5.4 x86_64 Linux ... 2.6.18-164.0.0.0.1.el5 #1 SMP Thu Sep 3 00:21:28 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Fedora 14 Linux x86_64 Linux ... 2.6.35.13-92.fc14.x86_64 #1 SMP Sat May 21 17:26:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux Fedora 14 Linux i386 Linux ... 2.6.35.13-92.fc14.i686 #1 SMP Sat May 21 17:39:42 UTC 2011 i686 i686 i386 GNU/Linux Perl version (i.e. perl -v): on Oracle Enterprise Linux 5.4 x86_64: This is perl, v5.8.8 built for x86_64-linux-thread-multi on Fedora 14 Linux x86_64: This is perl 5, version 12, subversion 3 (v5.12.3) built for x86_64-linux-thread-multi on Fedora 14 Linux i386: This is perl 5, version 12, subversion 3 (v5.12.3) built for i386-linux-thread-multi Please note, that locale returns utf8 too: $ locale LANG=en_US.utf8 LC_CTYPE="en_US.utf8" LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE="en_US.utf8" LC_MONETARY="en_US.utf8" LC_MESSAGES="en_US.utf8" LC_PAPER="en_US.utf8" LC_NAME="en_US.utf8" LC_ADDRESS="en_US.utf8" LC_TELEPHONE="en_US.utf8" LC_MEASUREMENT="en_US.utf8" LC_IDENTIFICATION="en_US.utf8" LC_ALL=

Subject:

utf8_to_utf8.pl

#!/usr/bin/perl use strict; use Encode; use bytes; # to show length in bytes (i.e. not in characters) sub fcStringShow { my $v = shift; print "String: {{" . $v . "}}\n"; print "\tlength = " . length( $v ) . "\n"; print "\t"; my $i = 0; while ( $i < length( $v ) ) { printf "\\x{%.2x}" , ord( substr( $v , $i ) ); $i = $i + 1; } print "\n"; } sub fcTest { my $a = shift; print "Not encoded string is: " . $a . "\n"; fcStringShow( $a ); my $b = Encode::encode_utf8( $a ); print "Encoded string is: " . $b . "\n"; fcStringShow( $b ); print "\n"; } fcTest( "abcde" ); fcTest( "Ã¤Ã¶Ã¼Ã©Ã " );

Subject:

utf8_to_utf8.log

Download utf8_to_utf8.log
application/octet-stream 520b

Message body not shown because it is not plain text.

Wed Jul 06 22:57:18 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

I think that the following information can help you to find the root of the issue. I use the iconv (on c++) to convert "äöüéà" string. And I think, that the reason of Encode::encode_utf8 wrong conversion is the fact: Encode::encode_utf8 "thinks" that utf8 "äöüéà" string is a latin1 (i.e. ISO-8859-1) string and that's why it converts it to wrong utf8 string (i.e. not to "äöüéà"). Please take a look on the following info to ensure: Start conversion from UTF-8 charset to UTF-8 charset ... String (which must be converted): äöüéà String (which must be converted) - hex data: 0xC3 , 0xA4 , 0xC3 , 0xB6 , 0xC3 , 0xBC , 0xC3 , 0xA9 , 0xC3 , 0xA0 , (10 bytes) i.e. in decimal: 195 , 164 , 195 , 182 , 195 , 188 , 195 , 169 , 195 , 160 , (10 bytes) i.e. perl string: \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} (10 bytes) This is a REVERSIBLE conversion !!! Result string is: äöüéà Result string - hex data: 0xC3 , 0xA4 , 0xC3 , 0xB6 , 0xC3 , 0xBC , 0xC3 , 0xA9 , 0xC3 , 0xA0 , (10 bytes) i.e. in decimal: 195 , 164 , 195 , 182 , 195 , 188 , 195 , 169 , 195 , 160 , (10 bytes) i.e. perl string: \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} (10 bytes) Start conversion from UTF-8 charset to ISO-8859-1 charset ... String (which must be converted): äöüéà String (which must be converted) - hex data: 0xC3 , 0xA4 , 0xC3 , 0xB6 , 0xC3 , 0xBC , 0xC3 , 0xA9 , 0xC3 , 0xA0 , (10 bytes) i.e. in decimal: 195 , 164 , 195 , 182 , 195 , 188 , 195 , 169 , 195 , 160 , (10 bytes) i.e. perl string: \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} (10 bytes) This is a REVERSIBLE conversion !!! Result string is: �� Result string - hex data: 0xE4 , 0xF6 , 0xFC , 0xE9 , 0xE0 , (5 bytes) i.e. in decimal: 228 , 246 , 252 , 233 , 224 , (5 bytes) i.e. perl string: \x{e4}\x{f6}\x{fc}\x{e9}\x{e0} (5 bytes) Start conversion from ISO-8859-1 charset to UTF-8 charset ... String (which must be converted): äöüéà String (which must be converted) - hex data: 0xC3 , 0xA4 , 0xC3 , 0xB6 , 0xC3 , 0xBC , 0xC3 , 0xA9 , 0xC3 , 0xA0 , (10 bytes) i.e. in decimal: 195 , 164 , 195 , 182 , 195 , 188 , 195 , 169 , 195 , 160 , (10 bytes) i.e. perl string: \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} (10 bytes) This is a REVERSIBLE conversion !!! Result string is: Ã¤Ã¶Ã¼Ã©Ã Result string - hex data: 0xC3 , 0x83 , 0xC2 , 0xA4 , 0xC3 , 0x83 , 0xC2 , 0xB6 , 0xC3 , 0x83 , 0xC2 , 0xBC , 0xC3 , 0x83 , 0xC2 , 0xA9 , 0xC3 , 0x83 , 0xC2 , 0xA0 , (20 bytes) i.e. in decimal: 195 , 131 , 194 , 164 , 195 , 131 , 194 , 182 , 195 , 131 , 194 , 188 , 195 , 131 , 194 , 169 , 195 , 131 , 194 , 160 , (20 bytes) i.e. perl string: \x{c3}\x{83}\x{c2}\x{a4}\x{c3}\x{83}\x{c2}\x{b6}\x{c3}\x{83}\x{c2}\x{bc}\x{c3}\x{83}\x{c2}\x{a9}\x{c3}\x{83}\x{c2}\x{a0} (20 bytes) Sincerely, Maryan Bahnyuk (AKA marbug)

Thu Jul 07 01:16:54 2011 DANKOGAI [...] cpan.org - Status changed from 'new' to 'resolved'

Thu Jul 07 02:28:43 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Sorry. Forgot to note: Reproduced on 2.43 too Show quoted text

cpan> i Encode

...(missed) CPAN_VERSION 2.43 ...(missed) INST_VERSION 2.43 Sincerely, Maryan Bahnyuk

Thu Jul 07 02:28:43 2011 The RT System itself - Status changed from 'resolved' to 'open'

Thu Jul 07 06:01:22 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Show quoted text

> I use the iconv (on c++) to convert "äöüéà" string.

Here is the same on perl. Files are attached. Sincerely, Maryan Bahnyuk (AKA marbug)

Subject:

iconv.log

Download iconv.log
application/octet-stream 501b

Message body not shown because it is not plain text.

Subject:

iconv.pl

#!/usr/bin/perl use Text::Iconv; use strict; my $a = "Ã¤Ã¶Ã¼Ã©Ã "; sub fcStringShow { my $v = shift; print "String: {{" . $v . "}}\n"; print "\tlength = " . length( $v ) . "\n"; print "\t"; my $i = 0; while ( $i < length( $v ) ) { printf "\\x{%.2x}" , ord( substr( $v , $i ) ); $i = $i + 1; } print "\n"; } fcStringShow( $a ); print "\nUTF-8 to UTF8\n"; my $converter = Text::Iconv->new( "UTF-8", "UTF-8" ); my $v = $converter->convert( $a ); fcStringShow( $v ); print "\nUTF-8 to Latin1 (ISO-8859-1)\n"; my $converter = Text::Iconv->new( "UTF-8" , "ISO-8859-1" ); my $v = $converter->convert( $a ); fcStringShow( $v ); print "\nLatin1 (ISO-8859-1) to UTF-8\n"; my $converter = Text::Iconv->new( "ISO-8859-1" , "UTF-8" ); my $v = $converter->convert( $a ); fcStringShow( $v );

Thu Jul 07 06:06:24 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Show quoted text

> > I use the iconv (on c++) to convert "äöüéà" string.

> > Here is the same on perl. Files are attached.

Also I'd like to note, that I am familiar with docs: CAVEAT: When you run $octets = encode("utf8", $string) , then $octets may not be equal to $string. Though they both contain the same data, the UTF8 flag for $octets is always off. Sorry, but this statement looks very "strange" and "wrong", because the results are unexpected. Sincerely, Maryan Bahnyuk (AKA marbug)

Thu Jul 07 06:21:41 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

As I see my files are corrupted after download. Here is their archived copies.

Subject:

utf8_to_utf8.zip

Download utf8_to_utf8.zip
application/zip 838b

Message body not shown because it is not plain text.

Subject:

iconv.tar.gz

Download iconv.tar.gz
application/x-gzip 612b

Message body not shown because it is not plain text.

Subject:

iconv.zip

Download iconv.zip
application/zip 821b

Message body not shown because it is not plain text.

Subject:

utf8_to_utf8.tar.gz

Download utf8_to_utf8.tar.gz
application/x-gzip 626b

Message body not shown because it is not plain text.

Thu Jul 07 19:33:15 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Show quoted text

> CAVEAT: When you run $octets = encode("utf8", $string) , then $octets > may not be equal to $string. Though they both contain the same data, the > UTF8 flag for $octets is always off. > > Sorry, but this statement looks very "strange" and "wrong", because the > results are unexpected.

Let's take a look on more details. Then I try to read/write perl variable from/to file or send/receive it via network without your Encode::encode method, all data is passed s it is. I.e. perl use the current data from source code. I.e. if source code is in utf8 charset, the variable contains utf8 bytes. If source code is not utf8 - it (i.e. perl variable) contains corresponding data. I.e. "internal" perl characters depends on the system locale. Please correct me if I am wrong. In such case, if I understand your Encode::encode method correctly, it should check "perl" charset first and if "perl" (i.e. variable) charset is utf8, Encode::encode_utf8 should return the same data. Am I wrong?

Thu Jul 07 19:43:26 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Show quoted text

> In such case, if I understand your Encode::encode method correctly, it > should check "perl" charset first and if "perl" (i.e. variable) charset > is utf8, Encode::encode_utf8 should return the same data.

Theoretically your methods may not check variable encoding (if it is quite hard). They may use perl's locale methods (i.e. "man perllocale") to define current variables charset or may have something like Encode::set_default_charset( 'utf8' ) to define current charset. In such case all your methods can use it's value for conversion. I.e. if Encode::set_default_charset sets utf8, Encode::encode_utf8 will return the same variable. Don't you think that in such case (i.e. with Encode::set_default_charset or perllocale) it will be more correct to work with your methods?

Thu Jul 07 19:59:56 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Show quoted text

> I.e. if Encode::set_default_charset sets utf8, Encode::encode_utf8

will return the same variable. I think, that in such case anyone can use your decode method to convert predefined variables from source code charset to perl's current charset. I.e. anyone will not need to convert all source code to local charset. I.e. the source code, written in utf8 charset, can be run without modifications on system with (for example) latin1 charset.

Mon Sep 12 04:49:50 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Hello. As I see this question is not interesting for you. Could you please close this ticket. Thanks and good luck.

Sat Nov 12 07:03:22 2011 chansen [...] cpan.org - Correspondence added

Hello Maryan, I'm not sure what you are proposing, but the output of your utf8_to_utf8.pl script is expected. Perl has two different representations of strings: use Encode qw[]; use Test::More tests => 3; my $string1 = "\xE5\xE4\xF6"; my $string2 = Encode::decode('ISO-8859-1', $string1); ok(utf8::is_utf8($string1) ne utf8::is_utf8($string2), "strings use different internal representations"); is($string1, $string2, "strings are equal regardless of internal character representation"); my $octets1 = Encode::encode_utf8($string1); my $octets2 = Encode::encode_utf8($string2); is($octets1, $octets2, "encoded strings are equal regardless of internal character representation"); Perl's internal character representation is a internal matter. Either make your program character aware or use octets, mixing characters with octets isn't a good idea. decode early and encode late. -- chansen

Tue Nov 22 20:50:00 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Dear Chansen Show quoted text

> I'm not sure what you are proposing, but the output of your > utf8_to_utf8.pl script is expected.

Sorry. Looks like I have low explanation skills. Can we check my conclusions step-by-step please. In my humble opinion this might help. First could you please take a look on the following simple examples and confirm that the issue "is" or "is not" present. vvvvvvvvvvvvvvv test1.pl #!/usr/bin/perl $a = "äöüéà"; $l = length( $a ); printf( "length of \$a is " . $l . "\n" ); $i = 0; while ( $i < $l ) { printf "\\x{%.2x}" , ord( substr( $a , $i ) ); $i++; } print "\n"; ^^^^^^^^^^^^^^^ test1.pl This source code is written in UTF8 charset and its output is the following: vvvvvvvvvvvvvvv test1.pl output length of $a is 10 \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} ^^^^^^^^^^^^^^^ test1.pl output I.e. $a variable value is already present as octets. Am I right? Please take a look on test1.png to ensure. Let's take a look on the same script which has been (a little bit) modified to use encode_utf8: vvvvvvvvvvvvvvv test2.pl #!/usr/bin/perl use Encode qw[]; sub show { my $t = shift; printf( "value: " . $t . "\n" ); my $l = length( $t ); printf( "length of $t is " . $l . "\n" ); $i = 0; while ( $i < $l ) { printf "\\x{%.2x}" , ord( substr( $t , $i ) ); $i++; } print "\n"; } $a = "äöüéà"; show( $a ); $o = Encode::encode_utf8( $a ); show( $o ); ^^^^^^^^^^^^^^^ test2.pl The output is the following: vvvvvvvvvvvvvvv test2.pl output value: äöüéà length of äöüéà is 10 \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} value: Ã¤Ã¶Ã¼Ã©Ã length of Ã¤Ã¶Ã¼Ã©Ã is 20 \x{c3}\x{83}\x{c2}\x{a4}\x{c3}\x{83}\x{c2}\x{b6}\x{c3}\x{83}\x{c2}\x{bc}\x{c3}\x{83}\x{c2}\x{a9}\x{c3}\x{83}\x{c2}\x{a0} ^^^^^^^^^^^^^^^ test2.pl output I.e. encode_utf8 makes the unexpected action: it converts already present octets to another "strange" octets. Is this correct? Let's take a look on the same script with the following additional line: Show quoted text

> Encode::_utf8_on( $a );

to ensure: vvvvvvvvvvvvvvv test3.pl #!/usr/bin/perl use Encode qw[]; sub show { my $t = shift; printf( "value: " . $t . "\n" ); my $l = length( $t ); printf( "length of $t is " . $l . "\n" ); $i = 0; while ( $i < $l ) { printf "\\x{%.2x}" , ord( substr( $t , $i ) ); $i++; } print "\n"; } $a = "äöüéà"; show( $a ); Encode::_utf8_on( $a ); $o = Encode::encode_utf8( $a ); show( $o ); ^^^^^^^^^^^^^^^ test3.pl Now the output is correct: vvvvvvvvvvvvvvv test3.pl output value: äöüéà length of äöüéà is 10 \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} value: äöüéà length of äöüéà is 10 \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} ^^^^^^^^^^^^^^^ test3.pl output Could you please let me know: 1) if utf8 flag absence for UTF8 STRING is a correct behavior 2) should I handle utf8 flag for utf8 strings before calling encode_utf8? I think that both answers are "NO" but I might be wrong. Please correct me. Show quoted text

> Perl's internal character representation is a internal matter.

Surely I agree with this. Show quoted text

> mixing characters with octets isn't a good idea. > decode early and encode late

Do you mean that encode_utf8 returns the internal data, which should be used only for passing to decode_utf8 and vice versa? Now I can't imagine how to use these wrong "octets" (i.e. octets which are returned by encode_utf8) for another purpose. Could you please give some examples if I am wrong. Show quoted text

> CAVEAT: When you run $octets = encode("utf8", $string) , > then $octets may not be equal to $string. Though they both > contain the same data, the UTF8 flag for $octets is always off.

These words looks like a misprint because: Show quoted text

> $octets may not be equal to $string

1) any utf8 string with non English characters after conversion to $octets will be ALWAYS not equal to $string (see my examples) Show quoted text

> Though they both contain the same data

2) if $string has the non English characters then $string and $octets will NOT contain the same data IN ANY CASE (because utf8 flag is not set) Am I wrong? I hope that my explanation is clear and that I have not irritate you by such "robotized" language. Sincerely, Maryan

Tue Nov 22 20:51:46 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Dear Chansen My examples from the previous post are in the attach Sincerely, Maryan

Subject:

examples.tar.gz

Download examples.tar.gz
application/x-gzip 146.1k

Message body not shown because it is not plain text.

Wed Nov 23 09:26:16 2011 chansen [...] cpan.org - Correspondence added

Hi Maryan, Vid Tue, 22 Nov 2011 kl. 20.50.00, skrev marbug: Show quoted text

> vvvvvvvvvvvvvvv test1.pl output > > length of $a is 10 > \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} > > ^^^^^^^^^^^^^^^ test1.pl output > > I.e. $a variable value is already present as octets. Am I right?

Yes, $a is a octet string encoded in UTF-8 encoding form. Show quoted text

> vvvvvvvvvvvvvvv test2.pl output > > value: äöüéà > length of äöüéà is 10 > \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} > value: Ã¤Ã¶Ã¼Ã©Ã > length of Ã¤Ã¶Ã¼Ã©Ã is 20 > \x{c3}\x{83}\x{c2}\x{a4}\x{c3}\x{83}\x{c2}\x{b6}\x{c3}\x{83}\x{c2}\x{bc}\x{c3}\x{83}\x{c2}\x{a9}\x{c3}\x{83}\x{c2}\x{a0} > > ^^^^^^^^^^^^^^^ test2.pl output

Show quoted text

> I.e. encode_utf8 makes the unexpected action: it converts already > present octets to another "strange" octets. Is this correct?

Yes it's correct. In my previous reply I mentioned that Perl has two different internal representations for strings. SvUTF8 indicates which one is used, if it's off ISO- 8859-1 (aka Latin1) is assumed. $a = "äöüéà"; $o = Encode::encode_utf8($a); in this case it's equivalent to: $o = Encode::encode_utf8(Encode::decode('ISO-8859-1', $a)); Show quoted text

> Let's take a look on the same script with the following additional > line:

> > Encode::_utf8_on( $a );

> to ensure: > > Now the output is correct: > > vvvvvvvvvvvvvvv test3.pl output > > value: äöüéà > length of äöüéà is 10 > \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} > value: äöüéà > length of äöüéà is 10 > \x{c3}\x{a4}\x{c3}\x{b6}\x{c3}\x{bc}\x{c3}\x{a9}\x{c3}\x{a0} > > ^^^^^^^^^^^^^^^ test3.pl output > > Could you please let me know: > 1) if utf8 flag absence for UTF8 STRING is a correct behavior

Yes, encode_utf8() expects a character string and returns a octet string. Show quoted text

> 2) should I handle utf8 flag for utf8 strings before calling > encode_utf8?

No, you should decode your data as early as possible and encode it as late as possible. $string = Encode::decode_utf8($input); # ↓ # process your data # ↓ $output = Encode::encode_utf8($string); Show quoted text

> > Perl's internal character representation is a internal matter.

> > Surely I agree with this. >

> > mixing characters with octets isn't a good idea. > > decode early and encode late

> > Do you mean that encode_utf8 returns the internal data, which should > be > used only for passing to decode_utf8 and vice versa?

No, encode_utf8() returns a octet string in UTF-8 encoding form and decode_utf8() decodes a octet string in UTF-8 encoding form. Show quoted text

> > Now I can't imagine how to use these wrong "octets" (i.e. octets which > are returned by encode_utf8) for another purpose. Could you please > give > some examples if I am wrong.

while (<STDIN>) { $string = Encode::decode_utf8($_); # input $string = lc $string; # process print Encode::encode_utf8($string); # output } Show quoted text

> >

> > CAVEAT: When you run $octets = encode("utf8", $string) , > > then $octets may not be equal to $string. Though they both > > contain the same data, the UTF8 flag for $octets is always off.

> > These words looks like a misprint because: > >

> > $octets may not be equal to $string

> > 1) any utf8 string with non English characters after conversion to > $octets will be ALWAYS not equal to $string (see my examples)

Correct. Show quoted text

> > Though they both contain the same data

> > 2) if $string has the non English characters then $string and $octets > will NOT contain the same data IN ANY CASE (because utf8 flag is not > set)

perhaps s/contain/represent/ would make it clearer, but I think it should be removed from documentation. Show quoted text

> Am I wrong?

No. -- chansen

Sat Dec 24 18:12:31 2011 marbuga [...] gmail.com - Correspondence added

From:

marbuga [...] gmail.com

Dear Chansen First of all I'd like to thank you for patience. Second ... Show quoted text

> > I.e. encode_utf8 makes the unexpected action: it converts already > > present octets to another "strange" octets. Is this correct?

> > Yes it's correct. In my previous reply I mentioned that Perl has two > different internal representations for strings. SvUTF8 indicates which > one is used, if it's off ISO-8859-1 (aka Latin1) is assumed.

Thank you. I tried to explain this in one of my previous messages. But as I see my try has been unsuccessful in cause of my low explanation skills. Show quoted text

> $a = "äöüéà"; > $o = Encode::encode_utf8($a); > > in this case it's equivalent to: > > $o = Encode::encode_utf8(Encode::decode('ISO-8859-1', $a));

I tried to explain that ISO-8859-1 should not be used here because the source code uses UTF8 charset. I.e. it's the root of the issue: Encode module assumes that ISO-8859-1 is a default charset but it's not correct... IMHO I tried to explain this but there was no success. In other words this root of the issue causes the following errors: some developers users this default 'ISO-8859-1' charset for double conversion in SOAP requests. I understand that it's not a question to your module, but your module allows this because it assumes 'ISO-8859-1' charset as default. Could you please explain why ISO-8859-1 is used when it has no relation to any object or setting or etc? Surely I do not expect the explanation of perl's internals. I just main the main purpose of Encode module. Should Encode module just convert data between charsets or should it ignore the utf8 flag and assume that string's charset is 'ISO-8859-1' when the string's charset is 'utf8'? I just mean that UTF-8 charset of sourse code should assume that $a = "äöüéà"; $o = Encode::encode_utf8($a); code should return $o = $a = "äöüéà"; when source code's charset is UTF-8. I.e. could you please explain the relation of 'ISO-8859-1' charset to this 'encoding'? I hope that I have shown the main idea of my previous messages. If not - please close this ticket. I don't what to irritate you by my bad English and my bad explanation skills. Thank you. And good luck.

Sat Dec 24 19:49:32 2011 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Sat Dec 24 18:12:31 2011, marbug wrote: Show quoted text

> Dear Chansen > > > First of all I'd like to thank you for patience. > > > > Second ... >

> > > I.e. encode_utf8 makes the unexpected action: it converts already > > > present octets to another "strange" octets. Is this correct?

> > > > Yes it's correct. In my previous reply I mentioned that Perl has two > > different internal representations for strings. SvUTF8 indicates which > > one is used, if it's off ISO-8859-1 (aka Latin1) is assumed.

> > Thank you. I tried to explain this in one of my previous messages. But > as I see my try has been unsuccessful in cause of my low explanation skills. > > >

> > $a = "äöüéà"; > > $o = Encode::encode_utf8($a); > > > > in this case it's equivalent to: > > > > $o = Encode::encode_utf8(Encode::decode('ISO-8859-1', $a));

> > I tried to explain that ISO-8859-1 should not be used here because the > source code uses UTF8 charset. I.e. it's the root of the issue: Encode > module assumes that ISO-8859-1 is a default charset but it's not > correct... IMHO > > I tried to explain this but there was no success. > > > > In other words this root of the issue causes the following errors: some > developers users this default 'ISO-8859-1' charset for double conversion > in SOAP requests. > > I understand that it's not a question to your module, but your module > allows this because it assumes 'ISO-8859-1' charset as default. > > > > Could you please explain why ISO-8859-1 is used when it has no relation > to any object or setting or etc? > > Surely I do not expect the explanation of perl's internals. I just main > the main purpose of Encode module. > > Should Encode module just convert data between charsets or should it > ignore the utf8 flag and assume that string's charset is 'ISO-8859-1' > when the string's charset is 'utf8'?

The encode_utf8 and decode_utf8 functions don’t convert between octet encodings. They convert between raw utf8 byte sequences and Unicode strings. By raw utf8 byte sequences, I mean that, if $a contains ā, length($a) will give 2, and ord($a) will give 0xc4, because $a contains two octets (i.e., "\xc4\x81") to represent that Unicode character. By Unicode strings, I mean that length($a) will give 1 and ord($a) will return 257, because that character is one logical unit in the string. encode_utf8 converts from the latter format to the former. So if you feed it "\xc4\x81", it can only assume you mean the Unicode characters U+00C4 and U+0081. So it produces ‘double’ encoding. encode_utf8 appears to be assuming ISO-8859-1, only as a side effect of the first 256 Unicode characters being identical to ISO-8859-1. So using encode_utf8 on a string that is already in an octet encoding (i.e., a sequence of octets) is almost always wrong. A couple of examples (assume these scripts are in utf8): #!perl use Encode; $x = 'ā'; # equivalent to "\xc4\x81" # encode_utf8() and encode() *always* take Unicode # arguments, so "\xc4\x81" is treated as U+00C4 U+0081. $y = encode_utf8 $x; # wrong # $y is now "\xC3\x84\xC2\x81" (I.e., $x is already encoded in utf8, so you don’t have to do anything to it.) #!perl use Encode; use utf8; # Now Perl decodes the script itself $x = 'ā'; # equivalent to "\x{101}" $y = encode_utf8 $x; # treated as U+0101 # $y is now "\xc4\x81" Does that make things clear? Show quoted text

> > > > I just mean that UTF-8 charset of sourse code should assume that > > $a = "äöüéà"; > $o = Encode::encode_utf8($a); > > code should return $o = $a = "äöüéà"; > > when source code's charset is UTF-8.

Did you ‘use utf8’? Show quoted text