Bug #54490 for Digest-SHA-PurePerl: unicode problems

Wed Feb 10 13:10:31 2010 cr2005 [...] u-club.de - Ticket created

Subject:

unicode problems

Hi! The PurePerl has got problems with adding unicode strings. It's similar to: https://rt.cpan.org/Public/Bug/Display.html?id=54369 Funny, this is not an .xs issue. Fixing SHA/PurePerl.pm is easy: use bytes; PS: Your Digest::SHA works correct! ;)

Mon Jan 14 20:10:42 2013 mshelor [...] cpan.org - TimeEstimated changed from (no value) to '60'

Mon Jan 14 20:10:42 2013 mshelor [...] cpan.org - TimeWorked changed from (no value) to '60'

Mon Jan 14 20:10:42 2013 mshelor [...] cpan.org - Given to MSHELOR

Mon Jan 14 20:10:43 2013 mshelor [...] cpan.org - Severity Normal added

Mon Jan 14 20:10:43 2013 mshelor [...] cpan.org - Fixed in 5.81 added

Mon Jan 14 20:29:53 2013 mshelor [...] cpan.org - Correspondence added

On Wed Feb 10 13:10:31 2010, chr wrote: Show quoted text

> The PurePerl has got problems with adding unicode strings. It's > similar to: > > https://rt.cpan.org/Public/Bug/Display.html?id=54369 > > Fixing SHA/PurePerl.pm is easy: > > use bytes; > > Your Digest::SHA works correct!

Note that "adding Unicode strings" to digest objects is a meaningless concept: SHA algorithms operate on sequences of bytes, whereas Unicode strings contain wide characters. By convention (ref. Digest::SHA1 and Digest::MD5), the appropriate response upon receiving a wide character is for the digest function to croak. So, this is what Digest::SHA and Digest::SHA::PurePerl now do as of 5.74 and 5.81, respectively. Mark

Mon Jan 14 20:29:54 2013 The RT System itself - Status changed from 'new' to 'open'

Mon Jan 14 20:29:54 2013 mshelor [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Jan 15 09:38:51 2013 cr2005 [...] u-club.de - Correspondence added

From:

cr2005 [...] u-club.de

Am Mo 14. Jan 2013, 20:29:53, MSHELOR schrieb: Show quoted text

> > Note that "adding Unicode strings" to digest objects is a meaningless

What is meaningless? Adding unicode|iso-8859|whatever strings, adding a sequence of image data? Adding any data? Show quoted text

> concept: SHA algorithms operate on sequences of bytes, whereas > Unicode strings contain wide characters.

Yes, SHA should operate on bytes and not on chars. A Digest shall not care about what those bytes are. It's perl who must handle unicode encoded strings in a special manner on string operations e.g. "length(), substr()". Show quoted text

> > By convention (ref. Digest::SHA1 and Digest::MD5), the appropriate > response upon receiving a wide character is for the digest function > to croak. So, this is what Digest::SHA and Digest::SHA::PurePerl > now do as of 5.74 and 5.81, respectively. >

I don't see a convention not using strings for a digest message. From "Digest::MD5": Show quoted text

> Since the MD5 algorithm is only defined for strings of bytes, it can > not be used on strings that contains chars with ordinal number above > 255.

That is a mistake! Perl got the "use bytes" pragma to force byte semantics rather than character semantics (see perldoc bytes)

Tue Jan 15 09:38:52 2013 The RT System itself - Status changed from 'resolved' to 'open'

Tue Jan 15 20:04:24 2013 mshelor [...] cpan.org - Correspondence added

On Tue Jan 15 09:38:51 2013, chr wrote: Show quoted text

> Am Mo 14. Jan 2013, 20:29:53, MSHELOR schrieb:

> > > > Note that "adding Unicode strings" to digest objects is a meaningless

> > What is meaningless? Adding unicode|iso-8859|whatever strings, > adding a sequence of image data? Adding any data?

A Unicode string is a sequence of wide characters. This is not the same thing as a sequence of bytes since the wide characters can have ordinal values larger than 255. There are many different ways to encode a Unicode string into a sequence of bytes. UTF-8 is only one of them. Therefore the digest of a Unicode string is ambiguous. Whereas image data, binary files, ordinary strings, etc. are sequences of bytes (unless otherwise specified) and hence can be added to digest objects. Show quoted text

> Perl got the "use bytes" pragma to force byte semantics > rather than character semantics (see perldoc bytes)

Yes, but that's just a reflection of the way Perl handles Unicode strings internally, viz. by encoding them in UTF-8. That's why "use bytes" is generally discouraged: it assumes the rest of the world is handling Unicode strings exactly the way Perl does internally, which is not always true. A Unicode string is a larger and more powerful abstraction than a byte sequence. If you choose to view it as a UTF-8 encoded sequence of bytes for the purpose of your application, that's certainly fine. But you'll have to be explicit by first saying "utf8::encode($string)" before sending it to the digest routines.

Tue Jan 15 20:04:25 2013 mshelor [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Jan 20 12:27:06 2013 cr2005 [...] u-club.de - Correspondence added

From:

cr2005 [...] u-club.de

Sorry, one thing remains ... When I filed this bug report, 3 years ago, the bug was different: Digest::SHA worked fine for me Digest::SHA::PurePerl produced different results on the same data (if feed w utf8). Digest::MD5 calculated right but silently dropped the "is_utf8" flag from $value. That screwed up further processing of $value which caused the bug I was originally hunting for. I was expecting the Digest:: would to the right thing regardless of the mess I feed them. And some seemed to do so. My mistake. I hoped that can be fixed. From the source: $bitstr = substr($bitstr, $numbits >> 3, _BYTECNT($bitcnt)); substr() operates on chars not on bytes unless you "use bytes" in that scope or use pack/unpack or presume $bitstr is a byte string and pray that I used utf8::encode() ... otherwise croak. I looked into that stuff again and there is one issue remaining: it does not croak as expected. I attach a test file. 1. It passes fine, if you uncomment and use utf8::encode. 2. It fails as expected not using utf8::encode This is how you want me, isn't it? It used to work with Digest::SHA 3 years ago. 3. but it does not croak as expected: string #1 no croak but just works string #3 croaks fine string #2 no croak - wrong result And the string was "touched": is_utf8=0 length=8

Subject:

ttest-digest-sha-utf8.pl.gz

Download ttest-digest-sha-utf8.pl.gz
application/x-gzip 634b

Message body not shown because it is not plain text.

Sun Jan 20 12:27:07 2013 The RT System itself - Status changed from 'resolved' to 'open'

Sun Jan 20 22:03:43 2013 mshelor [...] cpan.org - Correspondence added

On Sun Jan 20 12:27:06 2013, chr wrote: Show quoted text

> When I filed this bug report, 3 years ago, the bug was different: > > Digest::SHA worked fine for me > Digest::SHA::PurePerl produced different results on the same data (if > feed w utf8). > > I was expecting the Digest:: would to the right thing regardless of the > mess I feed them. And some seemed to do so.

Thanks, Chris, for the comment and very readable test script. My response, briefly summarized, is that both Digest::SHA and Digest::SHA::PurePerl now handle Unicode strings properly. This was not the case 3 years ago, when you felt that Digest::SHA was working fine. For example, let's take the case of your second test vector where the input is "blödsinn". This string is the same as: "bl" . chr(246) . "dsinn" which is a perfectly acceptable 8-character string, and can be digested straight away. Its expected SHA-1 digest value is c77a16d028753a1ae761ad8eb33f5bc307364a24 not bd0f217087566043ca73d9e9ce81f7c9a4311872. The latter is the digest value of the UTF-8 encoding of "blödsinn": "bl".chr(195).chr(182)."dsinn" Note that the string "blödsinn" contains no wide characters (i.e. no chars with ordinal values greater than 255), so there's no reason to croak on such input. Admittedly this is all rather confusing. I've added a section to the documentation of the next version explaining the rule for handling Unicode input. But I expect confusion will linger. That's understandable given that even Perl 5.6 was fatally confused when it came to Unicode. This makes using the "pack" function with C-C0-U-U0 templates highly unpredictable and inconsistent across Perl versions. Right now I'm revising both Digest::SHA modules to function consistently across all Perls, from 5.6 onwards. This isn't straightforward because of the need to work around Perl 5.6 bugs in the SvPVbyte macro and other Unicode-related operations. Mark

Sun Jan 20 22:03:45 2013 mshelor [...] cpan.org - Status changed from 'open' to 'resolved'

Mon Jan 21 13:32:00 2013 cr2005 [...] u-club.de - Correspondence added

From:

cr2005 [...] u-club.de

Show quoted text

> Right now I'm revising both Digest::SHA modules to function > consistently across all Perls, from 5.6 onwards. This isn't > straightforward because of the need to work around Perl 5.6 bugs in > the SvPVbyte macro and other Unicode-related operations.

I think we do agree on most of this stuff. What to simply add to the documentation? "This Digest operates on bytes not with strings. Strings must be presented as bytes within a desired Encoding." (In contrast to this, here comes music-fingerprints in mind, which operates on the sound data). As you are now reviewing the stuff, I'll talk about it's side effects to keep in mind. But where to start. This stuff effects all Digest:: modules. If I recall the original bug using Digest::MD5, it had this side effect: my $s; # a string may be is_utf8(); my $t; # a string may be is_utf8(); $digest_md5->add($s); # my mistake: adds a string # now $s is wasted, it's no more a valid string value # is_itf8() is now 0, length() is wrong # your mistake? # croak at least? I see, not easy: no types in script's values # die 'rtfm' if (is_utf8($data)); $t .= "The value of s is $s"; now $s has tainted $t ... somewhere else you _may_ get garbled output. In case of Digest::SHA, $s is still a valid and usable string but it's no longer is_utf8. I regard it as a bug, if add() garbles my data. 1st workaround: $digest_md5->add("$s"); # still a mistake So I have to wrap around this anyhow. Adding strings like: $digest->add_string($encoding, $string) or use utf8::encode() So, I just read the Digest docu about add() and addfile() ... they can use some clarification about strings. And the $io_handle should be opened with ':raw', set to binary, shouldn't it? About my test script: I'm expecting bd0f217087566043ca73d9e9ce81f7c9a4311872 as the correct result because I want a Digest of it's utf-8 bytes encoding, because I'm going to store my string data as utf8 in my database, filesystem or output to the web. I calculated the test values by storing the strings to utf8 encoded files, verified them with hexdump -C and run openssl dgst -sha < nonsense ... So what else can go wrong? Oh, yes, a utf8 file may start with a BOM ... Confusion about 'use utf8' The test script source is a utf8 encoded file. Run hexdump on it: 000003f0 64 38 31 39 30 62 0a 62 6c c3 b6 64 73 69 6e 6e |d8190b.bl..dsinn| Great headache if you write a test script about unicode etc: perl's runtime configuration of STDOUT is usually not utf8 but most linux-terminals are. So if it looks wrong, it may be ok and if it looks ok, it may be wrong. This is about 'use utf8' and the binmode stuff ... or use -CSDL switch. I do believe, I did that the right way (assuming your terminal is utf8 too). This bug is for the packagers who delivers perl runtime and terminals without setting $ENV{PERL_UNICODE} according to $LANG ?! So I may rewrite the test so it reads the strings from files instead. This will outline the problem by using open() a correct way. To add confusion I can create the string test files with some other encodings. However, I reagard this bug report is solved by adding a section to the general Digest documentation. chris

Mon Jan 21 13:32:02 2013 The RT System itself - Status changed from 'resolved' to 'open'

Mon Jan 21 20:38:37 2013 mshelor [...] cpan.org - Correspondence added

On Mon Jan 21 13:32:00 2013, chr wrote: Show quoted text

> I think we do agree on most of this stuff. What to simply add to the > documentation? > > "This Digest operates on bytes not with strings. Strings must be > presented as bytes within a desired Encoding." > ... > However, I reagard this bug report is solved by adding a section to > the general Digest documentation.

OK. I too am bothered by the fact that the digest routines "alter" data that's fed to them. But that alteration is really only seen by Perl (and by inspection functions like utf8::is_utf8) and never alters the meaning of the data when using the default (viz. character) semantics. Here's the new section on Unicode I'm adding to both of the Digest::SHA modules. Comments are welcomed: =head1 UNICODE Perl supports Unicode strings as of version 5.6. Such strings may contain wide characters, namely, characters whose ordinal values are greater than 255. This can cause problems for digest algorithms such as SHA that are specified to operate on sequences of bytes. The rule by which Digest::SHA handles a Unicode string is easy to state, but potentially confusing to grasp: the string is interpreted as a sequence of bytes, where each byte is equal to the ordinal value (viz. code point) of its corresponding Unicode character. That way, the Unicode version of the string 'abc' has exactly the same digest value as the ordinary string 'abc'. Since a wide character does not fit into a byte, the Digest::SHA routines croak if they encounter one. Whereas if a Unicode string contains no wide characters, the module accepts it quite happily. The following code illustrates the two cases: $str1 = pack('U*', (0..255)); print sha1_hex($str1); # ok $str2 = pack('U*', (0..256)); print sha1_hex($str2); # croaks =head1 NIST STATEMENT ON SHA-1 Mark

Mon Jan 21 20:38:38 2013 mshelor [...] cpan.org - Status changed from 'open' to 'resolved'