Bug #82378 for Digest-SHA: UTF-8 behaviour not documented

Thu Jan 03 08:32:19 2013 victor [...] vsespb.ru - Ticket created

Subject:

UTF-8 behaviour not documented

Dies with error "Wide character in subroutine entry" in newer versions. Works fine in older. It's not documented (except some unclear (for end-user) record in ChangeLog). I think this should be documented.

Thu Jan 03 18:35:38 2013 mshelor [...] cpan.org - Correspondence added 15 min

On Thu Jan 03 08:32:19 2013, vsespb wrote: Show quoted text

> Dies with error "Wide character in subroutine entry" in newer versions. > Works fine in older. > > It's not documented (except some unclear (for end-user) record in > ChangeLog). > > I think this should be documented.

Your report is incomplete since it contains no test case. Given the error message, it would appear as though you attempted to feed something like a Unicode character to Digest::SHA. Digest algorithms are defined to operate on sequences of bytes only, per specification. Since version 5.8, Perl accepts wide characters in strings. It makes no sense to pass such strings to a digest algorithm until they've been byte-encoded through something like UTF-8. This is common knowledge: it's not up to Digest::SHA or any other digest module to explain or document.

Thu Jan 03 18:35:38 2013 The RT System itself - Status changed from 'new' to 'open'

Thu Jan 03 18:35:39 2013 mshelor [...] cpan.org - Status changed from 'open' to 'rejected'

Thu Jan 03 19:04:24 2013 victor [...] vsespb.ru - Correspondence added

I agree that this is correct behaviour as SHA is defined only for Byte Strings. However I think this should be documented, especially if this behaviour is different in different versions (i.e. I can tell that older version contain bug - no error message in this case) test case: #!/usr/bin/perl use utf8; use Encode; use Digest::SHA qw/sha256_hex/; print sha256_hex(encode_utf8('тест')); print "\n"; print sha256_hex('тест'); print "\n"; old version prints two same SHA values. new version prints one (same) SHA and dies with warning. So without documentation bug in user program will remain unnoticed and will be the cause of program crash when it's used with new version of Digest::SHA (i.e. incompatibility between different versions of Digest::SHA). (and sometimes crash is worse than a such bug in user program - such bug does not affect anything as SHA is anyway same at least in my case) btw. Digest::SHA::PurePerl does not die with that error. On Fri Jan 04 03:35:38 2013, MSHELOR wrote: Show quoted text

> On Thu Jan 03 08:32:19 2013, vsespb wrote:

> > Dies with error "Wide character in subroutine entry" in newer versions. > > Works fine in older. > > > > It's not documented (except some unclear (for end-user) record in > > ChangeLog). > > > > I think this should be documented.

> > > Your report is incomplete since it contains no test case. > > Given the error message, it would appear as though you attempted to feed > something like a Unicode character to Digest::SHA. Digest algorithms > are defined to operate on sequences of bytes only, per specification. > > Since version 5.8, Perl accepts wide characters in strings. It makes no > sense to pass such strings to a digest algorithm until they've been > byte-encoded through something like UTF-8. This is common knowledge: > it's not up to Digest::SHA or any other digest module to explain or > document. > >

Thu Jan 03 19:04:25 2013 The RT System itself - Status changed from 'rejected' to 'open'

Fri Jan 04 05:20:20 2013 mshelor [...] cpan.org - Correspondence added

On Thu Jan 03 19:04:24 2013, vsespb wrote: Show quoted text

> I agree that this is correct behaviour as SHA is defined only for Byte > Strings. > > However I think this should be documented, especially if this behaviour > is different in different versions (i.e. I can tell that older version > contain bug - no error message in this case) > > test case: > > #!/usr/bin/perl > use utf8; > use Encode; > use Digest::SHA qw/sha256_hex/; > print sha256_hex(encode_utf8('тест')); > print "\n"; > print sha256_hex('тест'); > print "\n"; > > > old version prints two same SHA values. > new version prints one (same) SHA and dies with warning.

Thanks for supplying a test case. But note that the statement print sha256_hex('тест'); has no meaning since SHA and other digest algorithms operate on sequences of bytes, not on Unicode and other wide character data. In general when invalid data is fed to a program, the output is undefined: garbage in, garbage out. The fact that the output from invalid data changes from version to version is of no consequence. But I do understand your frustration, and therefore may decide to very briefly warn against wide characters in future versions of the documentation. If so I'll give you due acknowledgement. However, such practice is usually not recommended since it clutters the documentation with extraneous material, distracts the reader, and makes it more difficult to find information directly pertinent to the module.

Fri Jan 04 05:20:22 2013 mshelor [...] cpan.org - Status changed from 'open' to 'rejected'