Bug #54369 for Digest: Broken handling of unicode strings

Sat Feb 06 13:44:15 2010 cr2005 [...] u-club.de - Ticket created

Subject:

Broken handling of unicode strings

This is a follow up of "clears utf-8 flag" https://rt.cpan.org/Public/Bug/Display.html?id=17919 * I do believe the bug is in 'Digest', all algorithm are bugged. * Digest->add() handles utf8-strings incorrect (clears utf8-flag) * the calculated digest of an utf-8 string is false! To understand the problem, I have to start with an example why and when to 'use utf8'. I presume you have an utf8 enabled box. I've got here debian GNU/Linux 5.0 : echo $LANG de_DE.UTF-8 echo -n blödsinn | hexdump -C 00000000 62 6c c3 b6 64 73 69 6e 6e | bl..dsinn| I know, this pain in the ase. Since this web page does not define a charset, it might mess up my post. I will attach this report as an utf8 encoded textfile. #!/bin/perl -w $a = 'blödsinn'; printf "len of '$a' is %s\n", length $a; __END__ save this perl script as utf8, e.g. with vi :set fileencoding utf-8 :w test.pl :q $ perl test.pl output: len of 'blödsinn' is 9 (wrong.) There are two reasons to use utf8: - assign utf8 strings within a perl script - import the is_utf8() function So let's fix the Example: #!/bin/perl -w use utf8; $a = 'blödsinn'; # utf8::is_utf8($a) will now return 1 printf "len of '$a' is %s\n", length $a; __END__ $ perl test.pl output: len of 'bl?dsinn' is 8 So, length (of characters) is now correct but how about the garbled output? By default perl does not set std/in/out/err to utf8. Read about the -C flag in man perlrun: $ perl -CSDL test.pl output: len of 'blödsinn' is 8 Terrific. Now we got an working setup to look into Digest's problem. Let's trash it: #!/bin/perl -w use utf8; use Digest; # the next line contains a random asia character: $s = 'greetz from asia 業 !'; printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s); # pick any digest u like $dgst = Digest->new('SHA-1'); $dgst->add($s); # 'this is line 12' printf "Digest of '$s' is %s\n", $dgst->hexdigest; __END__ Carefully copy'n'paste the fancy double byte character. (I dunno what it means ;) $ perl -CSDL test.pl string is 'greetz from asia 業 !' len is: 20 flag: 1 Wide character in subroutine entry at test.pl line 12. Bang! Bämm. Epic fail. We can't cheat around this problem: #!/bin/perl -w use utf8; use Digest; $s = "blödsinn"; printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s); # pick any digest u like $dgst = Digest->new('SHA-1'); $dgst -> add($s); printf "Digest is %s\n", $dgst->hexdigest; printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s); __END__ $ perl -CSDL test.pl string is 'blödsinn' len is: 8 flag: 1 Digest is c77a16d028753a1ae761ad8eb33f5bc307364a24 string is 'blödsinn' len is: 8 flag: echo -n blödsinn | openssl dgst -sha1 bd0f217087566043ca73d9e9ce81f7c9a4311872 Well. Here everything went wrong. If the utf8 flag is set on a string: * looks like Digest tries to convert it to latin1 which fails with characters not defined in latin1: The above digest of 'c77a16d028753a1ae761ad8eb33f5bc307364a24' is correct for a latin1 string of 'blödsinn' * Digest resets the is_utf8 flag, but the scalar still contains the utf8 encoded string, so later my application runs into big trouble * => Digest calculates "wrong" or refuses to work at all. So there is an unnecessary character conversation happening under the hood of Digest.

Subject:

bugreport.digest

Broken handling of unicode strings This is a follow up of "clears utf-8 flag" https://rt.cpan.org/Public/Bug/Display.html?id=17919 * I do believe the bug is in 'Digest', all algorithm are bugged. * Digest->add() handles utf8-strings incorrect (clears utf8-flag) * the calculated digest of an utf-8 string is false! To understand the problem, I have to start with an example why and when to 'use utf8'. I presume you have an utf8 enabled box. I've got here debian GNU/Linux 5.0 : echo $LANG de_DE.UTF-8 echo -n blödsinn | hexdump -C 00000000 62 6c c3 b6 64 73 69 6e 6e |bl..dsinn| I know, this pain in the ase. Since this web page does not define a charset, it might mess up my post. I will attach this report as an utf8 encoded textfile. #!/bin/perl -w $a = 'blödsinn'; printf "len of '$a' is %s\n", length $a; __END__ save this perl script as utf8, e.g. with vi :set fileencoding utf-8 :w test.pl :q $ perl test.pl output: len of 'blödsinn' is 9 (wrong.) There are two reasons to use utf8: - assign utf8 strings within a perl script - import the is_utf8() function So let's fix the Example: #!/bin/perl -w use utf8; $a = 'blödsinn'; # utf8::is_utf8($a) will now return 1 printf "len of '$a' is %s\n", length $a; __END__ $ perl test.pl output: len of 'bl?dsinn' is 8 So, length (of characters) is now correct but how about the garbled output? By default perl does not set std/in/out/err to utf8. Read about the -C flag in man perlrun: $ perl -CSDL test.pl output: len of 'blödsinn' is 8 Terrific. Now we got an working setup to look into Digest's problem. Let's trash it: #!/bin/perl -w use utf8; use Digest; # the next line contains a random asia character: $s = 'greetz from asia 業 !'; printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s); # pick any digest u like $dgst = Digest->new('SHA-1'); $dgst->add($s); # 'this is line 12' printf "Digest of '$s' is %s\n", $dgst->hexdigest; __END__ Carefully copy'n'paste the fancy double byte character. (I dunno what it means ;) $ perl -CSDL test.pl string is 'greetz from asia 業 !' len is: 20 flag: 1 Wide character in subroutine entry at test.pl line 12. Bang! Bämm. Epic fail. We can't cheat around this problem: #!/bin/perl -w use utf8; use Digest; $s = "blödsinn"; printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s); # pick any digest u like $dgst = Digest->new('SHA-1'); $dgst -> add($s); printf "Digest is %s\n", $dgst->hexdigest; printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s); __END__ $ perl -CSDL test.pl string is 'blödsinn' len is: 8 flag: 1 Digest is c77a16d028753a1ae761ad8eb33f5bc307364a24 string is 'blödsinn' len is: 8 flag: echo -n blödsinn | openssl dgst -sha1 bd0f217087566043ca73d9e9ce81f7c9a4311872 Well. Here everything went wrong. If the utf8 flag is set on a string: * looks like Digest tries to convert it to latin1 which fails with characters not defined in latin1: The above digest of 'c77a16d028753a1ae761ad8eb33f5bc307364a24' is correct for a latin1 string of 'blödsinn' * Digest resets the is_utf8 flag, but the scalar still contains the utf8 encoded string, so later my application runs into big trouble * => Digest calculates "wrong" or refuses to work at all. So there is an unnecessary character conversation happening under the hood of Digest.

Tue Feb 09 11:21:08 2010 cr2005 [...] u-club.de - Correspondence added

From:

cr2005 [...] u-club.de

Am Sa 06. Feb 2010, 13:44:15, chr schrieb: Show quoted text

> > So there is an unnecessary character conversation happening under > the hood of Digest.

Well, I looked into MD5.xs and SHA1.xs I'm wondering about the SvPVbyte gotcha. I'm using perl v5.10.0 and using SvPV fixes the issue for me: #undef SvPVbyte #define SvPVbyte SvPV Now ->add() doesn't reset the utf8 flag, it eats multi byte chars and hashes are correct.

Wed Feb 10 11:39:37 2010 cr2005 [...] u-club.de - Correspondence added

From:

cr2005 [...] u-club.de

$ perl test-digest-sha1-unicode.pl 1..12 ok 1 - 'nonsense' is utf8 ok 2 - length 8 ok 3 - hash cb1dc474e185777dad218b7d60f2781723d8190b not ok 4 - 'nonsense' is still utf8 # Failed test ''nonsense' is still utf8' # at test-digest-sha1-unicode.pl line 26. ok 5 - 'blödsinn' is utf8 ok 6 - length 8 not ok 7 - hash bd0f217087566043ca73d9e9ce81f7c9a4311872 # Failed test 'hash bd0f217087566043ca73d9e9ce81f7c9a4311872' # at test-digest-sha1-unicode.pl line 25. # got: 'c77a16d028753a1ae761ad8eb33f5bc307364a24' # expected: 'bd0f217087566043ca73d9e9ce81f7c9a4311872' not ok 8 - 'blödsinn' is still utf8 # Failed test ''blödsinn' is still utf8' # at test-digest-sha1-unicode.pl line 26. ok 9 - '廢話' is utf8 ok 10 - length 2 Wide character in subroutine entry at test-digest-sha1-unicode.pl line 24, <DATA> line 3. # Looks like you planned 12 tests but ran 10. # Looks like you failed 3 tests of 10 run. # Looks like your test exited with 255 just after 10. patched (using SvPV): LD_PRELOAD=~/.cpan/build/Digest-SHA1-2.12-amnPuM/blib/arch/auto/Digest/SHA1/SHA1.so perl test-digest-sha1-unicode.pl 1..12 ok 1 - 'nonsense' is utf8 ok 2 - length 8 ok 3 - hash cb1dc474e185777dad218b7d60f2781723d8190b ok 4 - 'nonsense' is still utf8 ok 5 - 'blödsinn' is utf8 ok 6 - length 8 ok 7 - hash bd0f217087566043ca73d9e9ce81f7c9a4311872 ok 8 - 'blödsinn' is still utf8 ok 9 - '廢話' is utf8 ok 10 - length 2 ok 11 - hash aabc1a331f97ba4b4157abca134812992d22dccf ok 12 - '廢話' is still utf8

Subject:

test-digest-sha1-unicode.pl

#!/usr/bin/perl -w use strict; use utf8; use Test::More; use Digest::SHA1; binmode Test::More->builder->output, ':utf8'; binmode Test::More->builder->failure_output, ':utf8'; my @tests = <DATA>; plan tests => 4 * (scalar @tests); my $dgst = new Digest::SHA1; foreach my $test (@tests) { chomp $test; my ($str, $len, $hash) = split(/\t+/, $test); ok(utf8::is_utf8($str), "'$str' is utf8"); is(length($str), $len, "length $len"); $dgst->reset; $dgst->add($str); is($dgst->hexdigest, $hash, "hash $hash"); ok(utf8::is_utf8($str), "'$str' is still utf8"); } __DATA__ nonsense 8 cb1dc474e185777dad218b7d60f2781723d8190b blÃ¶dsinn 8 bd0f217087566043ca73d9e9ce81f7c9a4311872 å»¢è©± 2 aabc1a331f97ba4b4157abca134812992d22dccf