Skip Menu |

This queue is for tickets about the Digest-MD5 CPAN distribution.

Report information
The Basics
Id: 44927
Status: resolved
Priority: 0/
Queue: Digest-MD5

People
Owner: Nobody in particular
Requestors: jesse [...] bestpractical.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



CC: Jesse Vincent <jesse [...] bestpractical.com>
Subject: [PATCH] Whatever your feeling on whether I'm misusing encode, the two new tests _should_ have the same output.
Date: Fri, 10 Apr 2009 13:12:51 -0400
To: bug-Digest-MD5 [...] rt.cpan.org
From: Jesse Vincent <jesse [...] bestpractical.com>
Right now, #4 passes for me on 5.8.9 and #5 fails. --- t/utf8.t | 17 ++++++++++++++++- 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/t/utf8.t b/t/utf8.t index 6cf68b7..a0cdb1f 100644 --- a/t/utf8.t +++ b/t/utf8.t @@ -7,7 +7,7 @@ BEGIN { } } -print "1..3\n"; +print "1..5\n"; use strict; use Digest::MD5 qw(md5_hex); @@ -33,3 +33,18 @@ print "ok 2\n"; # reference print "not " unless md5_hex("foo\xFF") eq $exp; print "ok 3\n"; + +# autopromotion + +my $unistring = "Oslo.pm har sosialt medlemsmøte onsdag 1. April 2008, klokken 18:30. Vi treffes på Marhaba Café, Keysersgate 1."; +use Devel::Peek; +use Encode; +$unistring = Encode::decode_utf8($unistring); +print "not " if ( not utf8::is_utf8($unistring)); +print "ok 4\n"; + +md5_hex($unistring); +print "not " if ( not utf8::is_utf8($unistring)); +print "ok 5\n" + + -- 1.6.2.2
I disagree that it's a bug for strings to be utf8::downgraded() as a side effect of trying to take their digest. You should pass encoded strings (bytes) to md5() so it's not a bug that the function treat the string as bytes.
CC: jesse [...] bestpractical.com
Subject: Re: [rt.cpan.org #44927] [PATCH] Whatever your feeling on whether I'm misusing encode, the two new tests _should_ have the same output.
Date: Tue, 9 Jun 2009 16:16:08 -0400
To: Gisle_Aas via RT <bug-Digest-MD5 [...] rt.cpan.org>
From: Jesse Vincent <jesse [...] bestpractical.com>
On Tue, Jun 09, 2009 at 04:12:24PM -0400, Gisle_Aas via RT wrote: Show quoted text
> I disagree that it's a bug for strings to be utf8::downgraded() as a side effect of trying to take > their digest. You should pass encoded strings (bytes) to md5() so it's not a bug that the > function treat the string as bytes.
So users should expect that calling md5() on a string will modify the string? That feels like scary action at a distance. --
On Tue Jun 09 16:16:13 2009, jesse@bestpractical.com wrote: Show quoted text
> So users should expect that calling md5() on a string will modify the > string? That feels like scary action at a distance.
The string isn't really modified as it contains exactly the same chars as it did before and if you compare it with itself before and after using 'eq' it's still the same string. I'm sorry the perl Unicode model is so messy. I subscribe to the belief that the UTF8 flag isn't about string semantics.
CC: jesse [...] bestpractical.com
Subject: Re: [rt.cpan.org #44927] [PATCH] Whatever your feeling on whether I'm misusing encode, the two new tests _should_ have the same output.
Date: Tue, 9 Jun 2009 16:53:26 -0400
To: Gisle_Aas via RT <bug-Digest-MD5 [...] rt.cpan.org>
From: Jesse Vincent <jesse [...] bestpractical.com>
On Tue, Jun 09, 2009 at 04:43:32PM -0400, Gisle_Aas via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=44927 > > > On Tue Jun 09 16:16:13 2009, jesse@bestpractical.com wrote: >
> > So users should expect that calling md5() on a string will modify the > > string? That feels like scary action at a distance.
> > The string isn't really modified as it contains exactly the same chars as it did before and if you > compare it with itself before and after using 'eq' it's still the same string.
The string is modified. The octets in it aren't modified. If I pass in a string that has a utf8 flag set on it, Digest::MD5 shouldn't remove it. In my production code, this meant that adding an md5 call in a logging routine actually changed what got printed to the user afterward. That's wrong. Show quoted text
> I'm sorry the perl Unicode model is so messy.
So am I, believe me. And I'm well aware that it's certainly not your fault. We all know to beat up on Simon Cozens for it ;) Is your stance: * That you should be able to freely manipulate the UTF8 flag on scalars passed into md5() and calling code should only pass in scalars it doesn't expect to use again. * That the side-effect is a bug, but would be annoying to fix and isn't worth the effort. * Something else? If it's the latter, is this behaviour something you'd take a patch for? Best, Jesse --
My stance is: * there is something wrong with your print routine if it prints different bytes depending on if you have a utf8::upgraded() or utf8::downgraded() string. * the side-effect is hardly a bug and I really like to assume the "correct" Unicode model. * the SvPVbyte API in the core is there to be used for stuff that operates on bytes * the code fix you want is for md5() to make a mortal copy of any UTF8-flagged string passed in. I would find this annoying and against my own "good taste", but not hard to do. * if you insist enough to produce a patch for this I would probably apply it * I blame the EBCDIC guys for this mess * There is lots of other places with similar behaviour. If this is important (which I don't believe) we should make SvPVbyte not affect the UTF8 flag on the scalar.
CC: jesse [...] bestpractical.com
Subject: Re: [rt.cpan.org #44927] [PATCH] Whatever your feeling on whether I'm misusing encode, the two new tests _should_ have the same output.
Date: Tue, 9 Jun 2009 17:56:17 -0400
To: Gisle_Aas via RT <bug-Digest-MD5 [...] rt.cpan.org>
From: Jesse Vincent <jesse [...] bestpractical.com>
On Tue, Jun 09, 2009 at 05:51:48PM -0400, Gisle_Aas via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=44927 > > > My stance is: > > * there is something wrong with your print routine if it prints different bytes depending on if > you have a utf8::upgraded() or utf8::downgraded() string.
If there's, say, a join involved, that can trigger the Latin-1 autopromote. I agree it's wrong, but it's not really the calling code. Show quoted text
> * the code fix you want is for md5() to make a mortal copy of any UTF8-flagged string > passed in. I would find this annoying and against my own "good taste", but not hard to do.
I agree that making a mortal copy of the string would be ugly and distasteful. I'm beginning to wonder if it would be workable to check the utf8 flag as a string comes in and to restore the original value of that flag before returning. Show quoted text
> * if you insist enough to produce a patch for this I would probably apply it
Fair enough. Show quoted text
> * I blame the EBCDIC guys for this mess
*grin*
On Tue Jun 09 17:56:23 2009, jesse@bestpractical.com wrote: Show quoted text
> I agree that making a mortal copy of the string would be ugly and > distasteful. I'm beginning to wonder if it would be workable to check > the utf8 flag as a string comes in and to restore the original value > of that flag before returning.
Yeah, something like this would work as well: bool had_utf8 = SvUTF8(sv) sv_utf8_downgrade(sv, 0); # will croak for wide chars # ... do stuff if (had_utf8) sv_utf8_upgrade(sv) # ... return
Subject: Re: [rt.cpan.org #44927] [PATCH] Whatever your feeling on whether I'm misusing encode, the two new tests _should_ have the same output.
Date: Fri, 12 Jun 2009 18:46:09 -0400
To: Gisle_Aas via RT <bug-Digest-MD5 [...] rt.cpan.org>
From: Jesse Vincent <jesse [...] bestpractical.com>
Here's a proposed patch, with updated testfile so it can still run on 5.6. Tested on 5.8.9 and 5.6.2 on OSX, which are what I have handy. I'm certainly no XS guru, so please feel free to ask me to redo this if the intent looks reasonable but the code makes you cringe. Best, Jesse

Message body is not shown because sender requested not to inline it.

A tweaked version of this patch was applied as <http://github.com/gisle/digest- md5/commit/55e2592f1ce834e50568ea68b31bff3f3e775bdc>