Subject: | Broken handling of unicode strings |
This is a follow up of
"clears utf-8 flag"
https://rt.cpan.org/Public/Bug/Display.html?id=17919
* I do believe the bug is in 'Digest', all algorithm are bugged.
* Digest->add() handles utf8-strings incorrect (clears utf8-flag)
* the calculated digest of an utf-8 string is false!
To understand the problem, I have to start with an example why and
when to 'use utf8'.
I presume you have an utf8 enabled box. I've got here debian GNU/Linux
5.0 :
echo $LANG
de_DE.UTF-8
echo -n blödsinn | hexdump -C
00000000 62 6c c3 b6 64 73 69 6e 6e |
bl..dsinn|
I know, this pain in the ase.
Since this web page does not define a charset, it might mess up my
post.
I will attach this report as an utf8 encoded textfile.
#!/bin/perl -w
$a = 'blödsinn';
printf "len of '$a' is %s\n", length $a;
__END__
save this perl script as utf8, e.g. with vi
:set fileencoding utf-8
:w test.pl
:q
$ perl test.pl
output:
len of 'blödsinn' is 9
(wrong.)
There are two reasons to use utf8:
- assign utf8 strings within a perl script
- import the is_utf8() function
So let's fix the Example:
#!/bin/perl -w
use utf8;
$a = 'blödsinn';
# utf8::is_utf8($a) will now return 1
printf "len of '$a' is %s\n", length $a;
__END__
$ perl test.pl
output:
len of 'bl?dsinn' is 8
So, length (of characters) is now correct but how about the garbled
output? By default perl does not set std/in/out/err to utf8. Read
about the -C flag in man perlrun:
$ perl -CSDL test.pl
output:
len of 'blödsinn' is 8
Terrific. Now we got an working setup to look into Digest's problem.
Let's trash it:
#!/bin/perl -w
use utf8;
use Digest;
# the next line contains a random asia character:
$s = 'greetz from asia 業 !';
printf "string is '$s' len is: %s flag: %s\n", length($s),
utf8::is_utf8($s);
# pick any digest u like
$dgst = Digest->new('SHA-1');
$dgst->add($s); # 'this is line 12'
printf "Digest of '$s' is %s\n", $dgst->hexdigest;
__END__
Carefully copy'n'paste the fancy double byte character. (I dunno what
it means ;)
$ perl -CSDL test.pl
string is 'greetz from asia 業 !' len is: 20 flag: 1
Wide character in subroutine entry at test.pl line 12.
Bang! Bämm. Epic fail.
We can't cheat around this problem:
#!/bin/perl -w
use utf8;
use Digest;
$s = "blödsinn";
printf "string is '$s' len is: %s flag: %s\n", length($s),
utf8::is_utf8($s);
# pick any digest u like
$dgst = Digest->new('SHA-1');
$dgst -> add($s);
printf "Digest is %s\n", $dgst->hexdigest;
printf "string is '$s' len is: %s flag: %s\n", length($s),
utf8::is_utf8($s);
__END__
$ perl -CSDL test.pl
string is 'blödsinn' len is: 8 flag: 1
Digest is c77a16d028753a1ae761ad8eb33f5bc307364a24
string is 'blödsinn' len is: 8 flag:
echo -n blödsinn | openssl dgst -sha1
bd0f217087566043ca73d9e9ce81f7c9a4311872
Well. Here everything went wrong. If the utf8 flag is set on a string:
* looks like Digest tries to convert it to latin1 which fails with
characters not defined in latin1:
The above digest of 'c77a16d028753a1ae761ad8eb33f5bc307364a24' is
correct for a latin1 string of 'blödsinn'
* Digest resets the is_utf8 flag, but the scalar still contains the
utf8 encoded string,
so later my application runs into big trouble
* => Digest calculates "wrong" or refuses to work at all.
So there is an unnecessary character conversation happening under the
hood of Digest.
Subject: | bugreport.digest |
Broken handling of unicode strings
This is a follow up of
"clears utf-8 flag"
https://rt.cpan.org/Public/Bug/Display.html?id=17919
* I do believe the bug is in 'Digest', all algorithm are bugged.
* Digest->add() handles utf8-strings incorrect (clears utf8-flag)
* the calculated digest of an utf-8 string is false!
To understand the problem, I have to start with an example why and when to 'use utf8'.
I presume you have an utf8 enabled box. I've got here debian GNU/Linux 5.0 :
echo $LANG
de_DE.UTF-8
echo -n blödsinn | hexdump -C
00000000 62 6c c3 b6 64 73 69 6e 6e |bl..dsinn|
I know, this pain in the ase.
Since this web page does not define a charset, it might mess up my post.
I will attach this report as an utf8 encoded textfile.
#!/bin/perl -w
$a = 'blödsinn';
printf "len of '$a' is %s\n", length $a;
__END__
save this perl script as utf8, e.g. with vi
:set fileencoding utf-8
:w test.pl
:q
$ perl test.pl
output:
len of 'blödsinn' is 9
(wrong.)
There are two reasons to use utf8:
- assign utf8 strings within a perl script
- import the is_utf8() function
So let's fix the Example:
#!/bin/perl -w
use utf8;
$a = 'blödsinn';
# utf8::is_utf8($a) will now return 1
printf "len of '$a' is %s\n", length $a;
__END__
$ perl test.pl
output:
len of 'bl?dsinn' is 8
So, length (of characters) is now correct but how about the garbled output? By default perl does not set std/in/out/err to utf8. Read about the -C flag in man perlrun:
$ perl -CSDL test.pl
output:
len of 'blödsinn' is 8
Terrific. Now we got an working setup to look into Digest's problem.
Let's trash it:
#!/bin/perl -w
use utf8;
use Digest;
# the next line contains a random asia character:
$s = 'greetz from asia 業 !';
printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s);
# pick any digest u like
$dgst = Digest->new('SHA-1');
$dgst->add($s); # 'this is line 12'
printf "Digest of '$s' is %s\n", $dgst->hexdigest;
__END__
Carefully copy'n'paste the fancy double byte character. (I dunno what it means ;)
$ perl -CSDL test.pl
string is 'greetz from asia 業 !' len is: 20 flag: 1
Wide character in subroutine entry at test.pl line 12.
Bang! Bämm. Epic fail.
We can't cheat around this problem:
#!/bin/perl -w
use utf8;
use Digest;
$s = "blödsinn";
printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s);
# pick any digest u like
$dgst = Digest->new('SHA-1');
$dgst -> add($s);
printf "Digest is %s\n", $dgst->hexdigest;
printf "string is '$s' len is: %s flag: %s\n", length($s), utf8::is_utf8($s);
__END__
$ perl -CSDL test.pl
string is 'blödsinn' len is: 8 flag: 1
Digest is c77a16d028753a1ae761ad8eb33f5bc307364a24
string is 'blödsinn' len is: 8 flag:
echo -n blödsinn | openssl dgst -sha1
bd0f217087566043ca73d9e9ce81f7c9a4311872
Well. Here everything went wrong. If the utf8 flag is set on a string:
* looks like Digest tries to convert it to latin1 which fails with characters not defined in latin1:
The above digest of 'c77a16d028753a1ae761ad8eb33f5bc307364a24' is correct for a latin1 string of 'blödsinn'
* Digest resets the is_utf8 flag, but the scalar still contains the utf8 encoded string,
so later my application runs into big trouble
* => Digest calculates "wrong" or refuses to work at all.
So there is an unnecessary character conversation happening under the hood of Digest.