Bug #93139 for Digest-SHA: Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars

Tue Feb 18 11:47:49 2014 achim.adam [...] univie.ac.at - Ticket created

CC:	MARKOV Solutions <solutions [...] overmeer.net>
Subject:	Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Tue, 18 Feb 2014 17:47:33 +0100
To:	bug-Digest-SHA [...] rt.cpan.org
From:	Achim Adam <achim.adam [...] univie.ac.at>

hi, we are verifying xmldsig signatures using Digest::SHA (5.86), and have noticed that the unicode awareness that was added in 5.74 (i.e. the use of SvPVbyte instead of SvPV in SHA.xs's add() or sha1() functions), leads to the mangling of the input data for correct UTF-8-enabled scalars, and the subsequent generation of incorrect digests. consider for example a file named UTF8, with a content of 2 bytes: 0xC3 0xA9. (this is the correct UTF-8 encoding for the unicode character u+00E9). now consider the following test script, where the generated digests are compared with the one generated by the `sha256sum' command.

Download digest-sha.tgz
application/octet-stream 490b

Message body not shown because it is not plain text.

------------------------------------------------------------------------------------- use strict; use warnings; use Digest::SHA; use Devel::Peek; my $separator = ('-' x 85)."\n"; my ($fh, $utf8, $sha256_sys, $sha256_perl); print STDERR $separator; ($sha256_sys) = split /\s+/, `sha256sum UTF8`; printf STDERR "%-20s % 50s\n", "sha256sum command:", $sha256_sys; print STDERR $separator; open $fh, 'UTF8'; $utf8 = <$fh>; close $fh; print STDERR "perl raw read, before SHA:\n"; Dump $utf8; $sha256_perl = Digest::SHA->new(256)->add($utf8)->hexdigest; printf STDERR "%-20s % 50s\n", "perl raw read:", $sha256_perl; print STDERR "perl raw read, after SHA:\n"; Dump $utf8; print STDERR $separator; open $fh, '<:encoding(UTF-8)', 'UTF8'; $utf8 = <$fh>; close $fh; print STDERR "perl :utf8 read before SHA:\n"; Dump $utf8; $sha256_perl = Digest::SHA->new(256)->add($utf8)->hexdigest; printf STDERR "%-20s % 50s\n", "perl :utf8 read:", $sha256_perl; print STDERR "perl :utf8 read after SHA:\n"; Dump $utf8; print STDERR $separator; ------------------------------------------------------------------------------------- the output is: ------------------------------------------------------------------------------------- sha256sum command: 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c ------------------------------------------------------------------------------------- perl raw read, before SHA: SV = PV(0x9c130e8) at 0x9c24ec0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x9c92788 "\303\251"\0 CUR = 2 LEN = 80 perl raw read: 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c perl raw read, after SHA: SV = PV(0x9c130e8) at 0x9c24ec0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x9c92788 "\303\251"\0 CUR = 2 LEN = 80 ------------------------------------------------------------------------------------- perl :utf8 read before SHA: SV = PV(0x9c130e8) at 0x9c24ec0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x9c92788 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 80 perl :utf8 read: de2e331d891ae267a7009cb45b4e8830f170e0c937288ea2731a1941c7a53b0d perl :utf8 read after SHA: SV = PV(0x9c130e8) at 0x9c24ec0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x9c92788 "\351"\0 CUR = 1 LEN = 80 ------------------------------------------------------------------------------------- note that the following scalar, read from the file from an ':utf8'-enabled filehandle: SV = PV(0x9c130e8) at 0x9c24ec0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x9c92788 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 80 ... is the correct, perl-internal representation of the input. (this is also what for example XML::LibXML correctly yields, when an UTF-8 encoded document is parsed.) the generated digest however, is wrong -- and you can see why; sv_utf8_downgrade(), probably called by SvPVbyte, has mangled the input: SV = PV(0x9c130e8) at 0x9c24ec0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x9c92788 "\351"\0 CUR = 1 LEN = 80 i think replacing SvPV by SvPVbyte was probably a mistake; a digest module should likely not have any unicode awareness, and in particular, should not modify its input. it should be using SvPV. regards, Achim PS: equally on: Linux acdev 2.6.32-5-686 #1 SMP Wed Jan 11 12:29:30 UTC 2012 i686 GNU/Linux Summary of my perl5 (revision 5 version 14 subversion 2) configuration: Platform: osname=linux, osvers=2.6.32-5-686, archname=i686-linux uname='linux acdev 2.6.32-5-686 #1 smp wed jan 11 12:29:30 utc 2012 i686 gnulinux ' config_args='-des' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.4.5', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib /usr/lib64 libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.11.2.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.11.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector' Characteristics of this binary (from libperl): Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under linux Compiled at Jan 27 2012 16:45:11 @INC: /usr/local/lib/perl5/site_perl/5.14.2/i686-linux /usr/local/lib/perl5/site_perl/5.14.2 /usr/local/lib/perl5/5.14.2/i686-linux /usr/local/lib/perl5/5.14.2 /usr/local/lib/perl5/site_perl and: Linux devbaer 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=3.2.0-4-amd64, archname=x86_64-linux-gnu-thread-multi uname='linux perlbaer7 3.2.0-4-amd64 #1 smp debian 3.2.46-1 x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Dldflags=-Wl,-rpath=/opt/perl-5.18.2/lib/5.18.2/CORE -Wl,-z,relro -Dlddlflags=-shared -Wl,-rpath=/opt/perl-5.18.2/lib/5.18.2/CORE -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/opt/perl-5.18.2 -Dprivlib=/opt/perl-5.18.2/share/5.18.2 -Darchlib=/opt/perl-5.18.2/lib/5.18.2 -Dvendorprefix=/opt/perl-5.18.2 -Dvendorlib=/opt/perl-5.18.2/share/perl5 -Dvendorarch=/opt/perl-5.18.2/lib/perl5 -Dsiteprefix=/opt/perl-5.18.2 -Dsitelib=/opt/perl-5.18.2/share/5.18.2 -Dsitearch=/opt/perl-5.18.2/lib/5.18.2 -Dman1dir=/opt/perl-5.18.2/man/man1 -Dman3dir=/opt/perl-5.18.2/man/man3 -Dsiteman1dir=/opt/perl-5.18.2/man/man1 -Dsiteman3dir=/opt/perl-5.18.2/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.18.2 -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.7.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags ='-Wl,-rpath=/opt/perl-5.18.2/lib/5.18.2/CORE -Wl,-z,relro -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=, so=so, useshrplib=true, libperl=libperl.so.5.18.2 gnulibc_version='2.13' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/opt/perl-5.18.2/lib/5.18.2/CORE' cccdlflags='-fPIC', lddlflags='-shared -Wl,-rpath=/opt/perl-5.18.2/lib/5.18.2/CORE -Wl,-z,relro -L/usr/local/lib -fstack-protector' Characteristics of this binary (from libperl): Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS PERL_DONT_CREATE_GVSV PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF USE_REENTRANT_API Built under linux Compiled at Jan 9 2014 12:54:46 @INC: /opt/perl-5.18.2/lib/5.18.2 /opt/perl-5.18.2/share/5.18.2 /opt/perl-5.18.2/lib/perl5 /opt/perl-5.18.2/share/perl5 /opt/perl-5.18.2/lib/5.18.2 /opt/perl-5.18.2/share/5.18.2

Wed Feb 19 04:54:34 2014 mshelor [...] cpan.org - Correspondence added 30 min

RT-Send-CC:

solutions [...] overmeer.net

Thank you for taking the time to compose these thoughtful remarks, analyses, and code samples. I very much agree with your notion that (in an ideal world, at least) module developers shouldn't have to be explicitly concerned with Unicode, I battled against including it, and relented only after convincing myself that it was necessary to do so in order to remain consistent with Perl's default character semantics as of version 5.6. The reason it's necessary to use SvPVbyte is to ensure that the *same* digest will be calculated for a file containing, using your example, the single letter é, regardless of whether it's represented using the single byte value 0xe9, or using the two byte values 0xc3 0xa9 (with UTF8 flag set) corresponding to this letter's Unicode UTF-8 encoding. This guarantees that all UTF-8 encoded Unicode files with code point values less than 256 will be semantically equivalent to their corresponding pure latin-1 versions as far as digest computation goes. And the only way to accommodate that is to use SvPVbyte, which employs utf8::downgrade to any data marked as UTF-8. Note also that the 'shasum' command (which uses Digest::SHA underneath) computes exactly the same result as the GNU coreutils 'sha256sum' command, as expected: $ sha256sum UTF8 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c UTF8 $ shasum -a 256 UTF8 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c UTF8 So it's not accurate to say that Digest::SHA mangles the data. It leaves the binary contents of files intact unless the programmer explicitly overrides this by opening the file with alternate IO layers (e.g. "<:encoding(UTF-8)"). What you're actually arguing is that Perl should have adopted byte semantics rather than character semantics back in 5.6 when that decision was made. But that decision is long past. And easy to accommodate once you understand and get used to it.

Wed Feb 19 04:54:34 2014 The RT System itself - Status changed from 'new' to 'open'

Wed Feb 19 04:54:35 2014 mshelor [...] cpan.org - Status changed from 'open' to 'rejected'

Wed Feb 19 04:54:35 2014 mshelor [...] cpan.org - Taken

Wed Feb 19 05:31:53 2014 solutions [...] overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #93139] Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Wed, 19 Feb 2014 11:31:34 +0100
To:	Mark Shelor via RT <bug-Digest-SHA [...] rt.cpan.org>
From:	Mark Overmeer <solutions [...] overmeer.net>

* Mark Shelor via RT (bug-Digest-SHA@rt.cpan.org) [140219 09:54]: Show quoted text

> The reason it's necessary to use SvPVbyte is to ensure that the *same* > digest will be calculated for a file containing, using your example, > the single letter é, regardless of whether it's represented using the > single byte value 0xe9, or using the two byte values 0xc3 0xa9 (with > UTF8 flag set) corresponding to this letter's Unicode UTF-8 encoding.

And this is exactly the mis-conception what this report is about. In my application, I need to change the SHA in SOAP-XML messages. This SHA is calculated outside Perl, XML is utf8. My application got broken because of this "smart downgrade": even the tiniest change will change the outcome! IMHO, SHA is a bit-wise operator: has nothing to do with strings. The parameter should be interpreted as bytes and croak when it sees utf8: it is to the user to encode() before calling ::SHA, if the user thinks it is acceptable. Maybe add a new add_bytes() which behaves that way? -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Wed Feb 19 06:41:09 2014 mshelor [...] cpan.org - Correspondence added

RT-Send-CC:

solutions [...] overmeer.net

On Wed Feb 19 05:31:53 2014, solutions@overmeer.net wrote: Show quoted text

> * Mark Shelor via RT (bug-Digest-SHA@rt.cpan.org) [140219 09:54]:

> > The reason it's necessary to use SvPVbyte is to ensure that the *same* > > digest will be calculated for a file containing, using your example, > > the single letter é, regardless of whether it's represented using the > > single byte value 0xe9, or using the two byte values 0xc3 0xa9 (with > > UTF8 flag set) corresponding to this letter's Unicode UTF-8 encoding.

> > And this is exactly the mis-conception what this report is about. > > In my application, I need to change the SHA in SOAP-XML messages. This > SHA is calculated outside Perl, XML is utf8. My application got broken > because of this "smart downgrade": even the tiniest change will change > the outcome! > > IMHO, SHA is a bit-wise operator: has nothing to do with strings. > The parameter should be interpreted as bytes and croak when it sees > utf8: it is to the user to encode() before calling ::SHA, if the > user thinks it is acceptable. Maybe add a new add_bytes() which > behaves that way?

SHA is indeed a bit-wise specification. And full support for bitwise operations is available in Digest::SHA through the add_bits method if necessary. However, the CPAN "Digest" family standardizes on the abstraction of string input (via the "add" method) to which Digest::SHA and all other Digest modules must conform. And strings, in turn, became an expanded concept (to include Unicode) starting with Perl 5.6. This "character vs. byte semantics" issue has a long and contentious history if you research the bug archives for this and other Digest modules. However the resolution of the issue (via adoption of SvPVbyte) is now long established and accepted. And this resolution does not limit the user in any way. Programmers are free to cook (or leave raw) their data in any way desired before feeding it to Digest::SHA, which will always do the expected thing in the Perl framework of character semantics. And there's always the "use bytes" pragma if you don't like Perl's default character semantics. The desired data processing you mention is easily accomplished at the programmer end. Digest::SHA is a core module and already quite large ... new functionality is introduced only when it can't be done easily at the user end, or when it requires low-level access (such as the recent getstate/putstate methods). Mark

Wed Feb 19 11:39:03 2014 achim.adam [...] univie.ac.at - Correspondence added

Subject:	Re: [rt.cpan.org #93139] Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Wed, 19 Feb 2014 17:38:51 +0100
To:	bug-Digest-SHA [...] rt.cpan.org
From:	Achim Adam <achim.adam [...] univie.ac.at>

On Feb 19, 2014, at 10:54 34, Mark Shelor via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > > > Thank you for taking the time to compose these thoughtful remarks, analyses, and code samples. > > I very much agree with your notion that (in an ideal world, at least) module developers shouldn't have to be explicitly concerned with Unicode, I battled against including it, and relented only after convincing myself that it was necessary to do so in order to remain consistent with Perl's default character semantics as of version 5.6. > > The reason it's necessary to use SvPVbyte is to ensure that the *same* digest will be calculated for a file containing, using your example, the single letter é, regardless of whether it's represented using the single byte value 0xe9, or using the two byte values 0xc3 0xa9 (with UTF8 flag set) corresponding to this letter's Unicode UTF-8 encoding. > > This guarantees that all UTF-8 encoded Unicode files with code point values less than 256 will be semantically equivalent to their corresponding pure latin-1 versions as far as digest computation goes. And the only way to accommodate that is to use SvPVbyte, which employs utf8::downgrade to any data marked as UTF-8. > > Note also that the 'shasum' command (which uses Digest::SHA underneath) computes exactly the same result as the GNU coreutils 'sha256sum' command, as expected: > > $ sha256sum UTF8 > 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c UTF8 > > $ shasum -a 256 UTF8 > 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c UTF8 > > So it's not accurate to say that Digest::SHA mangles the data. It leaves the binary contents of files intact unless the programmer explicitly overrides this by opening the file with alternate IO layers (e.g. "<:encoding(UTF-8)"). > > What you're actually arguing is that Perl should have adopted byte semantics rather than character semantics back in 5.6 when that decision was made. But that decision is long past. And easy to accommodate once you understand and get used to it.

that is indeed my position... as futile as that is now. what you're effectively saying in my eyes, is that any unwitting user that inputs an UTF-8-marked scalar into Digest::SHA, will not only get the wrong digest for the "right" byte sequence, but have his scalar "broken" afterwards, too. this seems highly impractical; and arguing that the user should be aware of such a behaviour; that it is self-evidently implied by perl's character semantics, does not convince me at all. SvPV would yield, overall, a vastly more predictable behaviour -- that, i'd regard as beyond discussion. to sum up, i'm a 100% with Mark Overmeer on this: a digest module should operate on bytes, and perhaps croak when it sees the UTF-8 flag. as a side note: you probably should at least upgrade the scalar back after you're done, like for example MIME::Base64's encode() function. regards, Achim

Wed Feb 19 20:49:57 2014 mshelor [...] cpan.org - Correspondence added

On Wed Feb 19 11:39:03 2014, achim.adam@univie.ac.at wrote: Show quoted text

> > On Feb 19, 2014, at 10:54 34, Mark Shelor via RT wrote:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > > > > > Thank you for taking the time to compose these thoughtful remarks, > > analyses, and code samples. > > > > I very much agree with your notion that (in an ideal world, at least) > > module developers shouldn't have to be explicitly concerned with > > Unicode, I battled against including it, and relented only after > > convincing myself that it was necessary to do so in order to remain > > consistent with Perl's default character semantics as of version 5.6. > > > > The reason it's necessary to use SvPVbyte is to ensure that the > > *same* digest will be calculated for a file containing, using your > > example, the single letter é, regardless of whether it's represented > > using the single byte value 0xe9, or using the two byte values 0xc3 > > 0xa9 (with UTF8 flag set) corresponding to this letter's Unicode UTF- > > 8 encoding. > > > > This guarantees that all UTF-8 encoded Unicode files with code point > > values less than 256 will be semantically equivalent to their > > corresponding pure latin-1 versions as far as digest computation > > goes. And the only way to accommodate that is to use SvPVbyte, which > > employs utf8::downgrade to any data marked as UTF-8. > > > > Note also that the 'shasum' command (which uses Digest::SHA > > underneath) computes exactly the same result as the GNU coreutils > > 'sha256sum' command, as expected: > > > > $ sha256sum UTF8 > > 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c > > UTF8 > > > > $ shasum -a 256 UTF8 > > 4a99557e4033c3539de2eb65472017cad5f9557f7a0625a09f1c3f6e2ba69c4c > > UTF8 > > > > So it's not accurate to say that Digest::SHA mangles the data. It > > leaves the binary contents of files intact unless the programmer > > explicitly overrides this by opening the file with alternate IO > > layers (e.g. "<:encoding(UTF-8)"). > > > > What you're actually arguing is that Perl should have adopted byte > > semantics rather than character semantics back in 5.6 when that > > decision was made. But that decision is long past. And easy to > > accommodate once you understand and get used to it.

> > that is indeed my position... as futile as that is now. > what you're effectively saying in my eyes, is that any unwitting user > that inputs an UTF-8-marked > scalar into Digest::SHA, will not only get the wrong digest for the > "right" byte sequence, but have > his scalar "broken" afterwards, too. > this seems highly impractical; and arguing that the user should be > aware of such a behaviour; that > it is self-evidently implied by perl's character semantics, does not > convince me at all. > SvPV would yield, overall, a vastly more predictable behaviour -- > that, i'd regard as beyond > discussion. > > to sum up, i'm a 100% with Mark Overmeer on this: a digest module > should operate on bytes, and > perhaps croak when it sees the UTF-8 flag. > > as a side note: you probably should at least upgrade the scalar back > after you're done, like for > example MIME::Base64's encode() function. > > regards, > Achim

I certainly understand your frustrations, having shared them initially before coming to fuller grips with Unicode. The merging of Unicode into Perl was a vital step, but not a seamless one ... it has, does, and will continue to cause confusions and difficulties. The single most important point to grasp is that UTF8-marked data is NOT to be regarded as a sequence of bytes; rather it's a Unicode string. If you prefer to regard it as a byte sequence, that's what the "use bytes" pragma is for. Your suggestion to croak on all UTF8-marked data would amount to a deliberate exclusion of Unicode, which would upset pretty much everyone outside of the American sphere. Even the English wouldn't get their £'s worth :) Bear in mind though that all code points outside of Latin-1 Supplement WILL in fact cause croaking. But at least most of Europe with its umlauts, graves, and acutes is safe. I do take your point, however, that Digest::SHA's 'add' causes Perl to represent its internal data differently, even though absolutely no change to the data's meaning occurs whatsoever. This is what bothered me most about the initial uses of SvPVbyte in Digest modules ... routines should NOT modify input data unless explicitly designed to do so. Even though no real modification of meaning and use occurs in the case of SHA's 'add', it IS true that programs relying on details of the way Perl stores data internally could be effected. However, in most cases, it's questionable whether programs should ever be depending on Perl's internal representation details in the first case. Nonetheless I'll consider adding code to 'upgrade' any input data that passed through 'downgrade' before digest processing. That solution has a pleasing symmetry, and so makes this discussion very worthwhile. I appreciate your comments. Mark

Wed Feb 19 20:59:22 2014 mshelor [...] cpan.org - Forwarded Transaction #1328619 to MARKOV Solutions <solutions@overmeer.net>

Thu Feb 20 03:34:24 2014 MARKOV [...] cpan.org - Correspondence added

Show quoted text

> I certainly understand your frustrations, having shared them initially > before coming to fuller grips with Unicode. The merging of Unicode > into Perl was a vital step, but not a seamless one ... it has, does, > and will continue to cause confusions and difficulties.

Hi Mark, No, no, the idea that SHA computation has anything to do with characters is deeply flawed. SHA computation is about check-summing bits. I exactly know how Perl works with strings, and my many module prove that... there is no need to explain the difficulties of it. The Unicode concept (or even: the character concept) has no place in ::SHA at all. Show quoted text

> I do take your point, however, that Digest::SHA's 'add' causes Perl > to represent its internal data differently, even though absolutely no > change to the data's meaning occurs whatsoever.

What it "the data's meaning"? How do you know how to interpret the data? Bitwise equivalence is the only sensible operation on data in the SHA context. We lost about 3 man-days of work, when valid SHA's produced by external applications did not match the SHA produced by Digest::SHA anymore, because newer version of your module thinks to understand "the meaning" of our bits. So, our current work-around is to turn-off the utf8 flag before calling add(). Then, gladly, your routine does not touch the bytes. That the utf8 flag was 'on' in the data we pass in, was a bug in an other library, which is gladly also being fixed. The documentation of Digest::SHA agrees with us: $sha->add($data); # feed data into stream Because $data != $string your implementation disagrees with your SYNOPSIS. A possible doc fix: $sha->add($data); # Digest::SHA < 5.74 $sha->add($string); # Digest::SHA >= 5.74 And then add: $sha->add_bytes($data); # feed raw data $sha->add_string($string); # feed text The difference between add_bytes() and add_bits() is major. The add_bytes() should croak on utf8, or simply ignore that flag as pre 5.74. In add_string()/add() put a big warning that this only works if both signer as validator are written in Perl and have Digest::SHA > 5.74 For all applications which will ever look at that sha's outcome, this must be true. If you are not applying this change, then please leave this bug-report open so other victims can add their experiences to this thread. [ I very much appreciate your work on such a complex core infrastructure for the Perl. This is intended purely as a technical discussion. ] -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Feb 20 04:49:14 2014 mshelor [...] cpan.org - Correspondence added

RT-Send-CC:

solutions [...] overmeer.net

Show quoted text

> What it "the data's meaning"? How do you know how to interpret the > data? Bitwise equivalence is the only sensible operation on data in > the SHA context. > > We lost about 3 man-days of work, when valid SHA's produced by external > applications did not match the SHA produced by Digest::SHA anymore, > because newer version of your module thinks to understand "the meaning" > of our bits.

The loss of work effort is certainly regrettable. There's a cost to adopting Unicode, and we all have to pay it ... often in ways that we don't anticipate. I first became aware of this issue when helping a Swedish developer debug an issue he had with Digest::SHA1. A quick session with Devel::Peek identified the source of his problem: Digest::SHA1 was performing an internal 'downgrade' of data and causing the UTF8 flag on his data to be cleared. I was rather shocked that Gisle's SHA1 module was doing this, but came to understand the reason why. I then adapted my Digest::SHA module to follow suit. I am quite serious and deliberate when using such a pompous phrase as "the data's meaning." Because this is EXACTLY what is meant by Perl's default adoption of character semantics. Choosing such semantics as the default is precisely the way by which Perl accomplishes integration of Unicode. I've seen your impressive list of registered CPAN modules and appreciate your substantial contribution and programming expertise. I too have been at this game for many decades, and still experience the occasional frustration and loss of programming time. I regard this particular case as the price we all must pay for something very important to Perl and programming in general: namely, the adoption and support of Unicode. And I don't make this statement easily or lightly, given the fact that I'm an American rather than a European. The fact that SHA is a bitwise specification is only of secondary importance; the primary factor is that it must fit in to the Digest hierarchy of modules, whose standard "add" method is designed to act on general data (ref. "Digest" documentation). My awareness of the problems that can result from the use of Unicode in Digest::SHA was what prompted me earlier to add a special Unicode section to the module's documentation. If you refer to that section, you'll see the following remark: "Be aware that the digest routines silently convert UTF-8 input into its equivalent byte sequence in the native encoding (cf. utf8::downgrade). This side effect influences only the way Perl stores the data internally, but otherwise leaves the actual value of the data intact." I feel that this remark suffices to clarify the issue. Any further remarks are only likely to confuse rather than to enlighten. This is why I'm highly reluctant to add the extra documentation you propose, which could be quite intimidating and confusing to the majority of programmers who needn't be concerned with this issue. And I'm very insistent about resolving and closing bug reports as they arise. Leaving them open is not at all reassuring to users, creates clutter, and possibly signals that the developer isn't serious about fixing problems.

Thu Feb 20 05:33:34 2014 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #93139] Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Thu, 20 Feb 2014 11:33:02 +0100
To:	Mark Shelor via RT <bug-Digest-SHA [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Mark Shelor via RT (bug-Digest-SHA@rt.cpan.org) [140220 09:49]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > > I am quite serious and deliberate when using such a pompous phrase as > "the data's meaning." Because this is EXACTLY what is meant by Perl's > default adoption of character semantics. Choosing such semantics as > the default is precisely the way by which Perl accomplishes integration > of Unicode.

... but checksums are unrelated to characters... there is really no relation. So, you cannot defend this buggy behavior with Perl's concept of strings. Show quoted text

> "Be aware that the digest routines silently convert UTF-8 > input into its equivalent byte sequence in the native encoding > (cf. utf8::downgrade). This side effect influences only the way Perl > stores the data internally, but otherwise leaves the actual value of > the data intact."

This documentation is incomplete. "This side effect also makes that SHA values you calculate in Perl incompatible with SHA values calculated with other programming languages and libraries." An other example (besides my SOAP-XML problem): some front-end Java application saves the SHA of passwords in the database (in an attempt to be secure). The back-end Perl program checks the password (which is utf8) and because of its "smart" magic produces a different SHA. So, validation fails. Same for the SHAs in the /etc/shadow. A last attempt to get an agreement on this issue. When you do not want to fix or extend the code, maybe you could fix the documentation... to start, change the title: Digest::SHA - Perl extension for SHA-1/224/256/384/512 --> Digest::SHA - calculate SHA on Perl strings And please change the use of $data into $string everywhere in the docs, to be consistent with other code modules, which careful differentiates between byte and string parameters. Show quoted text

> And I'm very insistent about resolving and closing bug reports as > they arise.

This bug is not resolved: the documentation and the functionality do not match. One should not close a ticket for an persisting bug. We have reach a stale-mate, I am afraid... so maybe we should suspend this thread for some time ;-) -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Feb 20 10:10:39 2014 mshelor [...] cpan.org - Correspondence added

RT-Send-CC:

achim.adam [...] univie.ac.at

Having noticed the affiliation of the requester with the Universität Wien, I've made every effort to give this matter my time and respectful attention. However, there are so many factual errors and incautious remarks in your last message that it's becoming difficult to assess whether you're truly serious. For example, Show quoted text

> ... but checksums are unrelated to characters... there is really no > relation. So, you cannot defend this buggy behavior with Perl's > concept of strings.

First of all, you've failed to demonstrate any buggy behavior in the module. The behavior might not be convenient for your particular use, but it's documented and has a demonstrated long-term track record of reliability and broad usage throughout the world. So much so that it was adopted into the Perl core. Secondly, the hash values produced by SHA are related to all and any type of digital data one can imagine, even partial-byte data. It specifies the most general type of input possible assuming the bit as the atomic (indivisible) data unit of computation. And Digest::SHA implements the NIST SHA standard in its full generality while complying with all usability standards set by the CPAN parent Digest module. Show quoted text

> This documentation is incomplete. "This side effect also makes > that SHA values you calculate in Perl incompatible with SHA values > calculated with other programming languages and libraries."

Again, you've failed to demonstrate any case for the documentation's alleged incompleteness. Moreover, the Digest::SHA implementation successfully passes every single one of the tens-of-thousands of test vectors comprising the NIST SHA Validation System (SHAVS), including the ones for bit-oriented data. But again, you've yet to show any case where the module computes an incorrect hash value. Show quoted text

> A last attempt to get an agreement on this issue. When you do not > want to fix or extend the code, maybe you could fix the documentation... > to start, change the title: > > Digest::SHA - Perl extension for SHA-1/224/256/384/512 > --> Digest::SHA - calculate SHA on Perl strings > > And please change the use of $data into $string everywhere in the docs, > to be consistent with other code modules, which careful differentiates > between byte and string parameters.

I'm beginning to suspect you're not even familiar with the CPAN hierarchy of Digest modules, and how these modules conform to certain interface and naming standards for the sake of greater usability. If you care to check the documentation of the "Digest" module (which is the parent of all CPAN digest modules), you'll see that my use of "$data" corresponds EXACTLY to the documentation for that module, both in form and meaning. Show quoted text

> This bug is not resolved: the documentation and the functionality do > not match. One should not close a ticket for an persisting bug.

A "bug" has not even been identified in this case, so there's little point in worrying over a resolution. I'm perfectly willing to consider all serious remarks and criticisms, but it's unproductive for me to respond further without more research and reflection on your part. For example, you could have spared yourself much effort by simply researching the past tickets related to this issue. Also, more carefully reviewing your remarks with your colleagues at UW is highly advisable, especially before dispatching them for the general public to see and judge.

Thu Feb 20 11:27:02 2014 achim.adam [...] univie.ac.at - Correspondence added

Subject:	Re: [rt.cpan.org #93139] Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Thu, 20 Feb 2014 17:26:47 +0100
To:	bug-Digest-SHA [...] rt.cpan.org
From:	Achim Adam <achim.adam [...] univie.ac.at>

Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > > > Having noticed the affiliation of the requester with the Universität Wien, I've made every effort to give this matter my time and respectful attention.

and i truly thank you for this, and for the effort and the work you are dedicating to the maintenance of Digest::SHA. the remarks you are responding to here, are Mark Overmeer's, who worked with us on a project involving xmldsig -- not mine. but never mind, since i happen to share his views; in our opinion, you have adopted a lofty, insular paradigm that *in practice* will lead to problems for 99% of end users, while catering to 1% of edge cases. many modules that deal with unicode nowadays enable perl's utf-8 magic by default -- which to some extent is purported to be "transparent" to the end user, and supposed to just add to perl's internal knowledge about the meaning of a scalar's bytes. in my opinion, your implementation violates the "Practical" part in perl's name. regards, -A Show quoted text

> However, there are so many factual errors and incautious remarks in your last message that it's becoming difficult to assess whether you're truly serious. > For example, >

>> ... but checksums are unrelated to characters... there is really no >> relation. So, you cannot defend this buggy behavior with Perl's >> concept of strings.

> > > First of all, you've failed to demonstrate any buggy behavior in the module. The behavior might not be convenient for your particular use, but it's documented and has a demonstrated long-term track record of reliability and broad usage throughout the world. So much so that it was adopted into the Perl core. > > Secondly, the hash values produced by SHA are related to all and any type of digital data one can imagine, even partial-byte data. It specifies the most general type of input possible assuming the bit as the atomic (indivisible) data unit of computation. And Digest::SHA implements the NIST SHA standard in its full generality while complying with all usability standards set by the CPAN parent Digest module. > >

>> This documentation is incomplete. "This side effect also makes >> that SHA values you calculate in Perl incompatible with SHA values >> calculated with other programming languages and libraries."

> > > Again, you've failed to demonstrate any case for the documentation's alleged incompleteness. Moreover, the Digest::SHA implementation successfully passes every single one of the tens-of-thousands of test vectors comprising the NIST SHA Validation System (SHAVS), including the ones for bit-oriented data. But again, you've yet to show any case where the module computes an incorrect hash value. > >

>> A last attempt to get an agreement on this issue. When you do not >> want to fix or extend the code, maybe you could fix the documentation... >> to start, change the title: >> >> Digest::SHA - Perl extension for SHA-1/224/256/384/512 >> --> Digest::SHA - calculate SHA on Perl strings >> >> And please change the use of $data into $string everywhere in the docs, >> to be consistent with other code modules, which careful differentiates >> between byte and string parameters.

> > > I'm beginning to suspect you're not even familiar with the CPAN hierarchy of Digest modules, and how these modules conform to certain interface and naming standards for the sake of greater usability. If you care to check the documentation of the "Digest" module (which is the parent of all CPAN digest modules), you'll see that my use of "$data" corresponds EXACTLY to the documentation for that module, both in form and meaning. > >

>> This bug is not resolved: the documentation and the functionality do >> not match. One should not close a ticket for an persisting bug.

> > > A "bug" has not even been identified in this case, so there's little point in worrying over a resolution. I'm perfectly willing to consider all serious remarks and criticisms, but it's unproductive for me to respond further without more research and reflection on your part. For example, you could have spared yourself much effort by simply researching the past tickets related to this issue. > > Also, more carefully reviewing your remarks with your colleagues at UW is highly advisable, especially before dispatching them for the general public to see and judge. > > >

Thu Feb 20 14:53:39 2014 victor [...] vsespb.ru - Correspondence added

@all, MSHELOR is right, and you just understand perl strings wrong way. try read perl docs and this article http://blogs.perl.org/users/aristotle/2011/08/utf8-flag.html If strings are equal (perl "eq" operator), Digest::SHA will return same digest. That's it. Strings "\xE9" (without utf-8 flag), and "\xC3\xA9" (with utf8 flag) are equal. If you see string "\xC3\xA9" (with utf8 flag) - that means it's byte "\xE9" in binary context. If your code treats it as bytes "\xC3\xA9" - your code is wrong. On Thu Feb 20 20:27:02 2014, achim.adam@univie.ac.at wrote: Show quoted text

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > > > > > Having noticed the affiliation of the requester with the Universität > > Wien, I've made every effort to give this matter my time and > > respectful attention.

> > and i truly thank you for this, and for the effort and the work you > are > dedicating to the maintenance of Digest::SHA. > > the remarks you are responding to here, are Mark Overmeer's, who > worked with > us on a project involving xmldsig -- not mine. > > but never mind, since i happen to share his views; in our opinion, you > have > adopted a lofty, insular paradigm that *in practice* will lead to > problems > for 99% of end users, while catering to 1% of edge cases. > many modules that deal with unicode nowadays enable perl's utf-8 magic > by > default -- which to some extent is purported to be "transparent" to > the end > user, and supposed to just add to perl's internal knowledge about the > meaning > of a scalar's bytes. > in my opinion, your implementation violates the "Practical" part in > perl's > name. > > regards, > -A >

> > However, there are so many factual errors and incautious remarks in > > your last message that it's becoming difficult to assess whether > > you're truly serious. > > For example, > >

> >> ... but checksums are unrelated to characters... there is really no > >> relation. So, you cannot defend this buggy behavior with Perl's > >> concept of strings.

> > > > > > First of all, you've failed to demonstrate any buggy behavior in the > > module. The behavior might not be convenient for your particular > > use, but it's documented and has a demonstrated long-term track > > record of reliability and broad usage throughout the world. So much > > so that it was adopted into the Perl core. > > > > Secondly, the hash values produced by SHA are related to all and any > > type of digital data one can imagine, even partial-byte data. It > > specifies the most general type of input possible assuming the bit as > > the atomic (indivisible) data unit of computation. And Digest::SHA > > implements the NIST SHA standard in its full generality while > > complying with all usability standards set by the CPAN parent Digest > > module. > > > >

> >> This documentation is incomplete. "This side effect also makes > >> that SHA values you calculate in Perl incompatible with SHA values > >> calculated with other programming languages and libraries."

> > > > > > Again, you've failed to demonstrate any case for the documentation's > > alleged incompleteness. Moreover, the Digest::SHA implementation > > successfully passes every single one of the tens-of-thousands of test > > vectors comprising the NIST SHA Validation System (SHAVS), including > > the ones for bit-oriented data. But again, you've yet to show any > > case where the module computes an incorrect hash value. > > > >

> >> A last attempt to get an agreement on this issue. When you do not > >> want to fix or extend the code, maybe you could fix the > >> documentation... > >> to start, change the title: > >> > >> Digest::SHA - Perl extension for SHA-1/224/256/384/512 > >> --> Digest::SHA - calculate SHA on Perl strings > >> > >> And please change the use of $data into $string everywhere in the > >> docs, > >> to be consistent with other code modules, which careful > >> differentiates > >> between byte and string parameters.

> > > > > > I'm beginning to suspect you're not even familiar with the CPAN > > hierarchy of Digest modules, and how these modules conform to certain > > interface and naming standards for the sake of greater usability. If > > you care to check the documentation of the "Digest" module (which is > > the parent of all CPAN digest modules), you'll see that my use of > > "$data" corresponds EXACTLY to the documentation for that module, > > both in form and meaning. > > > >

> >> This bug is not resolved: the documentation and the functionality do > >> not match. One should not close a ticket for an persisting bug.

> > > > > > A "bug" has not even been identified in this case, so there's little > > point in worrying over a resolution. I'm perfectly willing to > > consider all serious remarks and criticisms, but it's unproductive > > for me to respond further without more research and reflection on > > your part. For example, you could have spared yourself much effort > > by simply researching the past tickets related to this issue. > > > > Also, more carefully reviewing your remarks with your colleagues at > > UW is highly advisable, especially before dispatching them for the > > general public to see and judge. > > > > > >

Thu Feb 20 22:25:53 2014 mshelor [...] cpan.org - Correspondence added

RT-Send-CC:

victor [...] vsespb.ru

On Thu Feb 20 11:27:02 2014, achim.adam@univie.ac.at wrote: Show quoted text

> and i truly thank you for this, and for the effort and the work you > are > dedicating to the maintenance of Digest::SHA. > > the remarks you are responding to here, are Mark Overmeer's, who > worked with > us on a project involving xmldsig -- not mine. > > but never mind, since i happen to share his views; in our opinion, you > have > adopted a lofty, insular paradigm that *in practice* will lead to > problems > for 99% of end users, while catering to 1% of edge cases. > many modules that deal with unicode nowadays enable perl's utf-8 magic > by > default -- which to some extent is purported to be "transparent" to > the end > user, and supposed to just add to perl's internal knowledge about the > meaning > of a scalar's bytes. > in my opinion, your implementation violates the "Practical" part in > perl's > name. > > regards, > -A

Again, I understand your frustrations. Note that Digest::SHA isn't the source of the lofty, insular paradigm you speak about. Rather, that paradigm comes from Perl itself in its support for Unicode. My package is simply consistent with that paradigm. What you're actually experiencing is the aggravation of the Perl/Unicode/UTF8 learning curve. No need to feel bad about it ... most of us belong to that club. The article Victor points to is one of the best I've seen on the Perl/UTF8 issue. But as simple and short as the article is, most programmers won't absorb its full meaning they gain practical (often painful) experience with Perl's integration of Unicode/UTF8.

Fri Feb 21 06:15:53 2014 mshelor [...] cpan.org - Correspondence added

On Wed Feb 19 11:39:03 2014, achim.adam@univie.ac.at wrote: Show quoted text

> as a side note: you probably should at least upgrade the scalar back > after you're done, like for > example MIME::Base64's encode() function.

I very much agree with the spirit of this idea, but was shocked at the performance penalty. Using Gisle's 'digest-bench' as a test (with the data changed to be UTF-8 instead of the original "a" .. "z"), the performance went from this $ perl /tmp/digest-bench Digest::SHA f26dc3da8e2b830fd4e8e6380f04286fcd4144a5 33554432/0.159559965133667 Digest::SHA 5.87 200.55 MB/s to this $ perl -Mblib /tmp/digest-bench Digest::SHA f26dc3da8e2b830fd4e8e6380f04286fcd4144a5 33554432/1.62478184700012 Digest::SHA 5.88 19.69 MB/s In other words, upgrading the data back to its original form slows the SHA processing down by a factor of 10. Digest::SHA is a workhorse, and many users (as relayed through emails) in the security, financial, and archival worlds simply couldn't (and wouldn't) tolerate such a performance hit. So this change will NOT be introduced.

Fri Feb 21 06:38:38 2014 victor [...] vsespb.ru - Correspondence added

On Wed Feb 19 20:39:03 2014, achim.adam@univie.ac.at wrote: Show quoted text

> as a side note: you probably should at least upgrade the scalar back > after you're done, like for > example MIME::Base64's encode() function.

if you do care about upgraded/downgraded string form (normally, you should not), you can stringify arguments, passed to Digest::SHA: Digest::SHA::sha256_hex( "$bytes" ) instead of Digest::SHA::sha256_hex( $bytes ) that leads to (possibly) some performance penalty (and additional memory usage). but anyway, if you care about performance (i.e. $bytes is a huge scalar), and this is performance bug for your case, you should not have $bytes in upgraded form anyway (because it takes more memory ), thus you should utf8::downgrade your $bytes as early as possible (if you suspect it can be in upgraded form). stringifying function arguments is often used technique. for example somefunc($1), somefunc($!), somefunc($@) can lead to weird side effects (depending on somefunc() implementation), and in case those effects discovered, often a caller considered responsible for the bug, and it's advised to use somefunc("$1"), somefunc("$!") or somefunc($!+0) etc instead. Digest::SHA is often used with huge data amounts (I would say, more often than MIME::Base64), thus copying it's arguments (to make sure it's unmodified) or upgrading it back for all users will lead to performance penalties.

Sat Feb 22 22:52:39 2014 achim.adam [...] univie.ac.at - Correspondence added

Subject:	Re: [rt.cpan.org #93139] Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Sun, 23 Feb 2014 04:52:25 +0100
To:	bug-Digest-SHA [...] rt.cpan.org
From:	Achim Adam <achim.adam [...] univie.ac.at>

Hi Mark and Victor, thank you again for sharing your time and extensive knowledge. Show quoted text

> On Feb 21, 2014, at 12:15 53, Mark Shelor via RT wrote: > <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > > On Wed Feb 19 11:39:03 2014, achim.adam@univie.ac.at wrote: >

> > as a side note: you probably should at least upgrade the scalar back > > after you're done, like for example MIME::Base64's encode() function.

> > I very much agree with the spirit of this idea, but was shocked at the performance penalty. > Using Gisle's 'digest-bench' as a test (with the data changed to be UTF-8 instead of the > original "a" .. "z"), the performance went from this > > $ perl /tmp/digest-bench Digest::SHA > f26dc3da8e2b830fd4e8e6380f04286fcd4144a5 > 33554432/0.159559965133667 > Digest::SHA 5.87 200.55 MB/s > > to this > > $ perl -Mblib /tmp/digest-bench Digest::SHA > f26dc3da8e2b830fd4e8e6380f04286fcd4144a5 > 33554432/1.62478184700012 > Digest::SHA 5.88 19.69 MB/s > > In other words, upgrading the data back to its original form slows the SHA processing down > by a factor of 10. > > Digest::SHA is a workhorse, and many users (as relayed through emails) in the security, > financial, and archival worlds simply couldn't (and wouldn't) tolerate such a performance hit. > > So this change will NOT be introduced.

i understand; thank you for the time you took to explore this. the particular matter, in any case, is of relatively little consequence. on the actual issue: once one understands that perl's character-semantics-enabled scalar is merely an "abstract entity", it is true that your current implementation can be regarded as "consistent" with the concept. but just as you argue that the responsability of verifying the properties of a scalar with regard to character semantics before submitting it to Digest::SHA (or any algorithm that is customarily byte-based, for that matter) lies with the programmer, one could have argued that the programmer must expressly encode a scalar prior to submitting it to a strictly byte-based Digest::SHA, which otherwise croaks with an "ambiguous input" error. the latter seems equally "consistent" to me; the path followed was thus a matter of choice. a choice that i personally regret, since i believe it will ultimately turn out to be a steady source of pitfalls for programmers. thanks & regards, Achim

Sun Feb 23 06:07:35 2014 mshelor [...] cpan.org - Correspondence added

RT-Send-CC:

victor [...] vsespb.ru

On Sat Feb 22 22:52:39 2014, achim.adam@univie.ac.at wrote: Show quoted text

> it is true that your current implementation can be regarded > as "consistent" with the concept. > but just as you argue that the responsability of verifying the > properties of a scalar > with regard to character semantics before submitting it to Digest::SHA > (or any algorithm > that is customarily byte-based, for that matter) lies with the > programmer, one could > have argued that the programmer must expressly encode a scalar prior > to submitting it > to a strictly byte-based Digest::SHA, which otherwise croaks with an > "ambiguous input" error. > the latter seems equally "consistent" to me; the path followed was > thus a matter of choice. > a choice that i personally regret, since i believe it will ultimately > turn out to be a > steady source of pitfalls for programmers.

Let's look into the details of what you're saying. I can't agree at all that the path followed was a matter of choice. Victor stated it concisely: in Perl, if two pieces of data are equal under the "eq" operator, then their digests MUST be equal as well. Otherwise the use of message digests for fingerprinting and authenticating data would be completely undermined. This is why it's mandatory (i.e. not a matter of choice) to normalize the data representation before calculating digests. And in Perl, the consistent way to normalize this representation is by performing utf8::downgrade on the data. Now it IS true (as you say) that the SHA specification operates on byte buffers: 64-byte buffers for SHA-1/224/256, and 128-byte buffers for the rest. And since byte values are constrained to be in the range 0..255, the Digest::SHA routines will croak if you attempt to feed in a wide character whose ordinal value is greater than 255. Such a character makes no sense in the context of SHA which, as you correctly say, is byte-oriented. So croaking under such circumstances is the appropriate thing to do, i.e. not a matter of choice. To be clear, the error message is actually "Wide character in subroutine entry" rather than the one you mention: the word "Ambiguous" shows up only in the high-level shasum script when two or more incompatible file modes are invoked. If you're getting an "ambiguous input" error, then it's coming from somewhere else. All of this shows that a programmer needs to be aware of the data that's being fed into SHA. It's always possible to render any Unicode document into a stream of bytes via Unicode/UTF-8 encoding, so it's possible to compute the SHA digest of any Unicode document, even those containing so-called wide characters with ordinal values greater than 255. Regards, Mark

Sun Feb 23 19:20:41 2014 achim.adam [...] univie.ac.at - Correspondence added

Subject:	Re: [rt.cpan.org #93139] Digest::SHA / unicode: the use of SvPVbyte instead of SvPV, mangles the data of correct, UTF-8 enabled scalars
Date:	Mon, 24 Feb 2014 01:20:26 +0100
To:	bug-Digest-SHA [...] rt.cpan.org
From:	Achim Adam <achim.adam [...] univie.ac.at>

Show quoted text

>> <URL: https://rt.cpan.org/Ticket/Display.html?id=93139 > >> >> it is true that your current implementation can be regarded as "consistent" with the concept. >> but just as you argue that the responsability of verifying the properties of a scalar >> with regard to character semantics before submitting it to Digest::SHA (or any algorithm >> that is customarily byte-based, for that matter) lies with the programmer, one could >> have argued that the programmer must expressly encode a scalar prior to submitting it >> to a strictly byte-based Digest::SHA, which otherwise croaks with an "ambiguous input" error. >> the latter seems equally "consistent" to me; the path followed was thus a matter of choice. >> a choice that i personally regret, since i believe it will ultimately turn out to be a >> steady source of pitfalls for programmers.

> > Let's look into the details of what you're saying. > > I can't agree at all that the path followed was a matter of choice. Victor stated it concisely: > in Perl, if two pieces of data are equal under the "eq" operator, then their digests MUST be > equal as well. > > Otherwise the use of message digests for fingerprinting and authenticating data would be completely > undermined. This is why it's mandatory (i.e. not a matter of choice) to normalize the data > representation before calculating digests. And in Perl, the consistent way to normalize this > representation is by performing utf8::downgrade on the data. > > Now it IS true (as you say) that the SHA specification operates on byte buffers: 64-byte buffers > for SHA-1/224/256, and 128-byte buffers for the rest. And since byte values are constrained to > be in the range 0..255, the Digest::SHA routines will croak if you attempt to feed in a wide > character whose ordinal value is greater than 255. Such a character makes no sense in the > context of SHA which, as you correctly say, is byte-oriented. So croaking under such circumstances > is the appropriate thing to do, i.e. not a matter of choice. > > To be clear, the error message is actually "Wide character in subroutine entry" rather than the > one you mention: the word "Ambiguous" shows up only in the high-level shasum script when two or > more incompatible file modes are invoked. If you're getting an "ambiguous input" error, then > it's coming from somewhere else. > > All of this shows that a programmer needs to be aware of the data that's being fed into SHA. > It's always possible to render any Unicode document into a stream of bytes via Unicode/UTF-8 > encoding, so it's possible to compute the SHA digest of any Unicode document, even those > containing so-called wide characters with ordinal values greater than 255.

i'm afraid i bungled the grammar on my last statement to such an extent that it became unintelligible. i must apoligize for that. i meant to outline an alternative paradigm for byte-oriented modules, suggesting that they should plainly reject character-semantics-enabled scalars on the grounds that their byte content is ambiguous in their specific context. you would certainly argue that it isn't ambiguous at all, and you would be right within perl's unicode concept. the practical problem i see, lies in the extremely widespread use of the UTF-8 encoding (web, XML etc.) -- and the fact that in perl, to take our previous example, "\xc3\xa9" (plain) is not equal to "\xc3\xa9" (utf-8 flag). when working with the UTF-8 encoding, your current paradigm effectively compels the programmer to inspect every scalar prior to submitting it, to see if it has been character-semantics-enabled somewhere along the way (which seems to be a common practice -- probably encouraged by the fact that coincidentally, an UTF-8-encoded source's bytes will be identical to perl's internal representation of its unicode characters). this "effect" leads me to think that rejecting character semantics altogether in byte-oriented modules (by croaking), thereby alerting programmers and forcing them to explicitely encode any character-semantics-enabled scalars they meant to submit, would be a "safer", more "practical" approach. but you helped me understand why such a proposition is probably a lost cause. regards & thanks -A

Mon Feb 24 06:09:10 2014 mshelor [...] cpan.org - Correspondence added

On Sun Feb 23 19:20:41 2014, achim.adam@univie.ac.at wrote: Show quoted text

> this "effect" leads me to think that rejecting character semantics > altogether in > byte-oriented modules (by croaking), thereby alerting programmers and > forcing them > to explicitely encode any character-semantics-enabled scalars they > meant to submit, > would be a "safer", more "practical" approach.

This is the approach taken by Python. It is indeed safer and forces the programmer to be more explicit, thereby catching potential errors earlier. Python's hashlib is in fact much stricter in that it won't allow any Unicode strings containing code points above 127, viz. Show quoted text

>>> import hashlib >>> s = u'abc' >>> hashlib.sha1(s).hexdigest()

'a9993e364706816aba3e25717850c26c9cd0d89d' Show quoted text

>>> s = u'abc' + unichr(128) >>> hashlib.sha1(s).hexdigest()

Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 3: ordinal not in range(128) whereas CPAN Digest modules are a bit more permissive by allowing code points up to 255 before croaking. This is typical: Perl tends to be more permissive than Python. The Python approach avoids precisely the confusion that arose in your case. But the Perl approach offers greater power and convenience despite the dangers. For example, the overwhelming number of documents in the myriad number of Western languages can be represented in either plain latin-1 encoding or Unicode/UTF-8, and the Digest modules will compute the same hash values for each pair, provided they're opened in the mode appropriate to their encoding (e.g "<:encoding(utf8)"). Since it's often unpredictable in advance which encoding (i.e. straight latin-1 or Unicode/UTF-8) will be used when constructing and transferring a document, this kind of flexibility is extremely useful. Perl simply does the right thing automatically.

Tue Sep 30 12:48:12 2014 dmuey [...] cpan.org - Correspondence added

This was very interesting to me (I’m the “I ♥ Unicode” guy ;p). I understand some folks don’t see a bug so I looked into it and found a nice way to demonstrate it. The attached script demonstrates how a Unicode string with >127 < 256 are indeed mangled: - the data is corrupt after (one byte goes missing) - it is no longer a Unicode string - the hash is wrong … while the rest (bytes or Unicode w/ > 255 character) work like you’d expect. multivac:~ dmuey$ perl tmp/d_sha Digest::SHA v5.92 on perl v5.016000 SHA in hex of test string is: 278ac637d0cb8707ce95d2f21c5c2ad60b2db354 Unicode (latin1) String before: dumper : $VAR1 = "X \x{e4} Z"; chr count: 5 byte size: 6 is unicode: 1 [sha1_hex] f34e2a11492d60a613b271b2f75de906bf38ba93 Unicode (latin1) String after: dumper : $VAR1 = 'X ? Z'; chr count: 5 byte size: 5 is unicode: 0 Bytes String before: dumper : $VAR1 = 'X ä Z'; chr count: 6 byte size: 6 is unicode: 0 [sha1_hex]: 278ac637d0cb8707ce95d2f21c5c2ad60b2db354 Bytes String after: dumper : $VAR1 = 'X ä Z'; chr count: 6 byte size: 6 is unicode: 0 Uncode (>latin1) String before: dumper : $VAR1 = "X \x{2665} Z"; chr count: 5 byte size: 7 is unicode: 1 [sha1_hex] died (as expected): Wide character in subroutine entry at tmp/d_sha line 32. Uncode (>latin1) String after: dumper : $VAR1 = "X \x{2665} Z"; chr count: 5 byte size: 7 is unicode: 1 multivac:~ dmuey$ HTH!

Subject:

d_sha

Download d_sha
application/octet-stream 1.5k

Message body not shown because it is not plain text.

Tue Sep 30 15:01:22 2014 mshelor [...] cpan.org - Correspondence added

RT-Send-CC:

victor [...] vsespb.ru

On Tue Sep 30 12:48:12 2014, DMUEY wrote: Show quoted text

> This was very interesting to me (I’m the “I ♥ Unicode” guy ;p). > > I understand some folks don’t see a bug so I looked into it and found > a nice way to demonstrate it. > > The attached script demonstrates how a Unicode string with >127 < 256 > are indeed mangled: > > - the data is corrupt after (one byte goes missing) > - it is no longer a Unicode string > - the hash is wrong > > … while the rest (bytes or Unicode w/ > 255 character) work like you’d > expect. > > multivac:~ dmuey$ perl tmp/d_sha > > Digest::SHA v5.92 on perl v5.016000 > SHA in hex of test string is: > 278ac637d0cb8707ce95d2f21c5c2ad60b2db354 > > Unicode (latin1) String before: > dumper : $VAR1 = "X \x{e4} Z"; > chr count: 5 > byte size: 6 > is unicode: 1 > [sha1_hex] f34e2a11492d60a613b271b2f75de906bf38ba93 > Unicode (latin1) String after: > dumper : $VAR1 = 'X ? Z'; > chr count: 5 > byte size: 5 > is unicode: 0 > > Bytes String before: > dumper : $VAR1 = 'X ä Z'; > chr count: 6 > byte size: 6 > is unicode: 0 > [sha1_hex]: 278ac637d0cb8707ce95d2f21c5c2ad60b2db354 > Bytes String after: > dumper : $VAR1 = 'X ä Z'; > chr count: 6 > byte size: 6 > is unicode: 0 > > Uncode (>latin1) String before: > dumper : $VAR1 = "X \x{2665} Z"; > chr count: 5 > byte size: 7 > is unicode: 1 > [sha1_hex] died (as expected): > Wide character in subroutine entry at tmp/d_sha line 32. > Uncode (>latin1) String after: > dumper : $VAR1 = "X \x{2665} Z"; > chr count: 5 > byte size: 7 > is unicode: 1 > multivac:~ dmuey$ > > HTH!

There's no bug that I can see. The SHA-1 hash value for the Unicode string "X \x{e4} Z" is indeed the value computed by the Digest::SHA module: viz. f34e2a11492d60a613b271b2f75de906bf38ba93. But you do help to highlight an important and subtle point. The point you're overlooking is how Unicode strings MUST be handled by the SHA module to remain consistent with Perl's default character semantics. The relevant paragraph from the Digest::SHA documentation is: "The rule by which Digest::SHA handles a Unicode string is easy to state, but potentially confusing to grasp: the string is interpreted as a sequence of byte values, where each byte value is equal to the ordinal value (viz. code point) of its corresponding Unicode character. That way, the Unicode string 'abc' has exactly the same digest value as the ordinary string 'abc'." So, the 5-character Unicode string "X \x{e4} Z" is seen by Digest::SHA as a sequence of 5 byte values, each corresponding to the code points (or ordinal values) of the individual characters. The hash value you give (278ac637d0cb8707ce95d2f21c5c2ad60b2db354) is incorrect for this string, and rather corresponds to the hash value of UTF-8 encoded version of that string seen as a sequence of 6 bytes. And that precisely amounts to an explicit use of 'byte' semantics, which is NOT the Perl default. Regards, Mark