Bug #88230 for Encode: COW breakage with _utf8

Thu Aug 29 06:00:50 2013 zefram [...] fysh.org - Ticket created

Subject:	COW breakage with _utf8_on()
Date:	Thu, 29 Aug 2013 11:00:31 +0100
To:	bug-Encode [...] rt.cpan.org
From:	Zefram <zefram [...] fysh.org>

Functions that side-effect a scalar, such as Encode::_utf8_on(), need to de-COW the operand. See [perl #79824] for the origins of this bug report; the bug has been fixed for core functions such as utf8::decode(). Recipe to reproduce problem: $ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}' Use of uninitialized value in print at -e line 1. For the purposes of this bug report, the string being _utf8_on-ed is always well-formed UTF-8, so the big documented caveat about _utf8_on doesn't apply. What happens here is that $k, having come from keys(), shares its PV buffer with the HEK in %a, the _utf8_on doesn't touch the PV, and when $k is later used as a hash key the hash value already computed for that PV is reused. But _utf8_on has changed the hash value of the scalar, by changing which character sequence it represents. So %h ends up with its hash key stored under the wrong hash value, hence in the wrong bucket, hence looking up by an independent copy of the key fails. -zefram

Thu Aug 29 09:20:50 2013 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Thu Aug 29 06:00:50 2013, zefram@fysh.org wrote: Show quoted text

> Functions that side-effect a scalar, such as Encode::_utf8_on(), need > to de-COW the operand. See [perl #79824] for the origins of this bug > report; the bug has been fixed for core functions such as > utf8::decode(). > Recipe to reproduce problem: > > $ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); > Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}' > Use of uninitialized value in print at -e line 1. > > For the purposes of this bug report, the string being _utf8_on-ed is > always well-formed UTF-8, so the big documented caveat about _utf8_on > doesn't apply. What happens here is that $k, having come from keys(), > shares its PV buffer with the HEK in %a, the _utf8_on doesn't touch > the PV, and when $k is later used as a hash key the hash value already > computed for that PV is reused. But _utf8_on has changed the hash > value > of the scalar, by changing which character sequence it represents. So > %h > ends up with its hash key stored under the wrong hash value, hence in > the > wrong bucket, hence looking up by an independent copy of the key > fails.

To fix this, you would need something like this: #ifndef SvIsCOW if (SvIsCOW(sv)) sv_force_normal(sv); #endif (I didn’t look at Encode’s source when I wrote that.)

Thu Aug 29 09:20:51 2013 The RT System itself - Status changed from 'new' to 'open'

Thu Aug 29 11:16:03 2013 DANKOGAI [...] cpan.org - Correspondence added

I don't think it is a bug since $bytes ne $utf8 where $utf8 = decode_utf8($bytes) stands for hash keys, too. % perl -MData::Dumper -MDevel::Peek -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Encode::_utf8_on($k); %h = %a; $h{$k}="acme";Dump\%h;print Dumper(\%h)' SV = IV(0x7fb83b02e718) at 0x7fb83b02e728 REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x7fb83b1a7ed8 SV = PVHV(0x7fb83b181cb8) at 0x7fb83b1a7ed8 REFCNT = 2 FLAGS = (SHAREKEYS,HASKFLAGS) ARRAY = 0x7fb83ac72478 (0:7, 2:1) hash quality = 62.5% KEYS = 2 FILL = 1 MAX = 7 Elt "L\303\251on" [UTF8 "L\x{e9}on"] HASH = 0x159f39b9 SV = PV(0x7fb83b007198) at 0x7fb83b006358 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x7fb83ac23038 "acme"\0 CUR = 4 LEN = 24 Elt "L\303\251on" HASH = 0x159f39b9 SV = PV(0x7fb83b007228) at 0x7fb83b02e710 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x7fb83ac0f238 "acme"\0 CUR = 4 LEN = 24 $VAR1 = { "L\x{e9}on" => 'acme', 'Léon' => 'acme' }; Same hash value, but considired different keys. Dan the Encode Maintainer On Thu Aug 29 06:00:50 2013, zefram@fysh.org wrote: Show quoted text

> Functions that side-effect a scalar, such as Encode::_utf8_on(), need > to de-COW the operand. See [perl #79824] for the origins of this bug > report; the bug has been fixed for core functions such as > utf8::decode(). > Recipe to reproduce problem: > > $ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); > Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}' > Use of uninitialized value in print at -e line 1. > > For the purposes of this bug report, the string being _utf8_on-ed is > always well-formed UTF-8, so the big documented caveat about _utf8_on > doesn't apply. What happens here is that $k, having come from keys(), > shares its PV buffer with the HEK in %a, the _utf8_on doesn't touch > the PV, and when $k is later used as a hash key the hash value already > computed for that PV is reused. But _utf8_on has changed the hash > value > of the scalar, by changing which character sequence it represents. So > %h > ends up with its hash key stored under the wrong hash value, hence in > the > wrong bucket, hence looking up by an independent copy of the key > fails. > > -zefram

Thu Aug 29 11:21:20 2013 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Thu Aug 29 09:20:50 2013, SPROUT wrote: Show quoted text

> On Thu Aug 29 06:00:50 2013, zefram@fysh.org wrote:

> > Functions that side-effect a scalar, such as Encode::_utf8_on(), need > > to de-COW the operand. See [perl #79824] for the origins of this bug > > report; the bug has been fixed for core functions such as > > utf8::decode(). > > Recipe to reproduce problem: > > > > $ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); > > Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}' > > Use of uninitialized value in print at -e line 1. > > > > For the purposes of this bug report, the string being _utf8_on-ed is > > always well-formed UTF-8, so the big documented caveat about _utf8_on > > doesn't apply. What happens here is that $k, having come from keys(), > > shares its PV buffer with the HEK in %a, the _utf8_on doesn't touch > > the PV, and when $k is later used as a hash key the hash value already > > computed for that PV is reused. But _utf8_on has changed the hash > > value > > of the scalar, by changing which character sequence it represents. So > > %h > > ends up with its hash key stored under the wrong hash value, hence in > > the > > wrong bucket, hence looking up by an independent copy of the key > > fails.

> > To fix this, you would need something like this: > > #ifndef SvIsCOW > if (SvIsCOW(sv)) > sv_force_normal(sv); > #endif

Sorry, I was half asleep when I wrote that. #ifndef SvIsCOW # define SvIsCOW (SvREADONLY(sv) && SvFAKE(sv)) #endif if (SvIsCOW(sv)) sv_force_normal(sv);

Thu Aug 29 11:24:23 2013 victor [...] vsespb.ru - Correspondence added

From:

victor [...] vsespb.ru

if we append "print $k eq "L\x{e9}on"" to original example, it will proof that it's a bug: perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}; print $k eq "L\x{e9}on"' Use of uninitialized value in print at -e line 1. 1 On Thu Aug 29 19:16:03 2013, DANKOGAI wrote: Show quoted text

> I don't think it is a bug since $bytes ne $utf8 where $utf8 = > decode_utf8($bytes) stands for hash keys, too. > > % perl -MData::Dumper -MDevel::Peek -MEncode -lwe > '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); Encode::_utf8_on($k); > %h = %a; $h{$k}="acme";Dump\%h;print Dumper(\%h)' > SV = IV(0x7fb83b02e718) at 0x7fb83b02e728 > REFCNT = 1 > FLAGS = (TEMP,ROK) > RV = 0x7fb83b1a7ed8 > SV = PVHV(0x7fb83b181cb8) at 0x7fb83b1a7ed8 > REFCNT = 2 > FLAGS = (SHAREKEYS,HASKFLAGS) > ARRAY = 0x7fb83ac72478 (0:7, 2:1) > hash quality = 62.5% > KEYS = 2 > FILL = 1 > MAX = 7 > Elt "L\303\251on" [UTF8 "L\x{e9}on"] HASH = 0x159f39b9 > SV = PV(0x7fb83b007198) at 0x7fb83b006358 > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x7fb83ac23038 "acme"\0 > CUR = 4 > LEN = 24 > Elt "L\303\251on" HASH = 0x159f39b9 > SV = PV(0x7fb83b007228) at 0x7fb83b02e710 > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x7fb83ac0f238 "acme"\0 > CUR = 4 > LEN = 24 > $VAR1 = { > "L\x{e9}on" => 'acme', > 'Léon' => 'acme' > }; > > Same hash value, but considired different keys. > > Dan the Encode Maintainer > > On Thu Aug 29 06:00:50 2013, zefram@fysh.org wrote:

> > Functions that side-effect a scalar, such as Encode::_utf8_on(), need > > to de-COW the operand. See [perl #79824] for the origins of this bug > > report; the bug has been fixed for core functions such as > > utf8::decode(). > > Recipe to reproduce problem: > > > > $ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); > > Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}' > > Use of uninitialized value in print at -e line 1. > > > > For the purposes of this bug report, the string being _utf8_on-ed is > > always well-formed UTF-8, so the big documented caveat about _utf8_on > > doesn't apply. What happens here is that $k, having come from > > keys(), > > shares its PV buffer with the HEK in %a, the _utf8_on doesn't touch > > the PV, and when $k is later used as a hash key the hash value > > already > > computed for that PV is reused. But _utf8_on has changed the hash > > value > > of the scalar, by changing which character sequence it represents. > > So > > %h > > ends up with its hash key stored under the wrong hash value, hence in > > the > > wrong bucket, hence looking up by an independent copy of the key > > fails. > > > > -zefram

Thu Aug 29 12:47:00 2013 DANKOGAI [...] cpan.org - Correspondence added

Fixed accordingly: % git diff Encode.xs diff --git a/Encode.xs b/Encode.xs index d088d25..c0d0591 100644 --- a/Encode.xs +++ b/Encode.xs @@ -837,6 +837,10 @@ CODE: OUTPUT: RETVAL +#ifndef SvIsCOW +# define SvIsCOW (SvREADONLY(sv) && SvFAKE(sv)) +#endif + SV * _utf8_on(sv) SV * sv @@ -845,6 +849,7 @@ CODE: if (SvPOK(sv)) { SV *rsv = newSViv(SvUTF8(sv)); RETVAL = rsv; + if (SvIsCOW(sv)) sv_force_normal(sv); SvUTF8_on(sv); } else { RETVAL = &PL_sv_undef; @@ -861,6 +866,7 @@ CODE: if (SvPOK(sv)) { SV *rsv = newSViv(SvUTF8(sv)); RETVAL = rsv; + if (SvIsCOW(sv)) sv_force_normal(sv); SvUTF8_off(sv); } else { RETVAL = &PL_sv_undef; Dan the Encode Maintainer On Thu Aug 29 11:21:20 2013, SPROUT wrote: Show quoted text

> On Thu Aug 29 09:20:50 2013, SPROUT wrote:

> > On Thu Aug 29 06:00:50 2013, zefram@fysh.org wrote:

> > > Functions that side-effect a scalar, such as Encode::_utf8_on(), need > > > to de-COW the operand. See [perl #79824] for the origins of this bug > > > report; the bug has been fixed for core functions such as > > > utf8::decode(). > > > Recipe to reproduce problem: > > > > > > $ perl -MEncode -lwe '%a=("L\x{c3}\x{a9}on"=>"acme"); ($k)=(keys %a); > > > Encode::_utf8_on($k); %h = ($k => "acme"); print $h{"L\x{e9}on"}' > > > Use of uninitialized value in print at -e line 1. > > > > > > For the purposes of this bug report, the string being _utf8_on-ed is > > > always well-formed UTF-8, so the big documented caveat about _utf8_on > > > doesn't apply. What happens here is that $k, having come from keys(), > > > shares its PV buffer with the HEK in %a, the _utf8_on doesn't touch > > > the PV, and when $k is later used as a hash key the hash value already > > > computed for that PV is reused. But _utf8_on has changed the hash > > > value > > > of the scalar, by changing which character sequence it represents. So > > > %h > > > ends up with its hash key stored under the wrong hash value, hence in > > > the > > > wrong bucket, hence looking up by an independent copy of the key > > > fails.

> > > > To fix this, you would need something like this: > > > > #ifndef SvIsCOW > > if (SvIsCOW(sv)) > > sv_force_normal(sv); > > #endif

> > Sorry, I was half asleep when I wrote that. > > #ifndef SvIsCOW > # define SvIsCOW (SvREADONLY(sv) && SvFAKE(sv)) > #endif > if (SvIsCOW(sv)) > sv_force_normal(sv);

Thu Aug 29 12:47:00 2013 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #88230 for Encode: COW breakage with _utf8_on()