Bug #115168 for Encode: [PATCH] Passing regex globals to decode() results in wrong result

Wed Jun 08 10:18:27 2016 dcollinsn [...] gmail.com - Ticket created

Subject:

[PATCH] Passing regex globals to decode() results in wrong result

Hello, In [perl #128143][0], a user reports inconsistent behavior between the following two ways to call decode: $x =~ s/(.)/$latin1->decode($1)/ge $x =~ s/(.)/Encode::decode('latin1', $1)/ge The issue seems to be that Encode.xs Method_decode() performs some operations on src without checking to see if it is readonly or magical. Since $1 is magical, this results in incorrect data. From stepping through with GDB, it looks like the corruption doesn't happen until this line, early in encode_method(): U8 *s = (U8 *) SvPV(src, slen); The attached patch will correct this issue by making a mortal copy in these cases. Submitting to get a ticket number, patch to follow. [0]: https://rt.perl.org/Ticket/Display.html?id=128143

Wed Jun 08 10:24:08 2016 dcollinsn [...] gmail.com - Correspondence added

From:

dcollinsn [...] gmail.com

Patches attached. I can't easily find a similar issue in encode(), but I might just be using the wrong testcase.

Subject:

0002-cpan-115168-Test-for-decode-1.patch

From 1104ffa0fae19c1da34a000a0196c3cac5a70763 Mon Sep 17 00:00:00 2001 From: Dan Collins <dcollinsn@gmail.com> Date: Wed, 8 Jun 2016 10:15:29 -0400 Subject: [PATCH 2/2] [cpan #115168] Test for ->decode($1) Adds a testcase to decode.t for the above issue. --- t/decode.t | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/t/decode.t b/t/decode.t index 77cdaba..dbbc190 100644 --- a/t/decode.t +++ b/t/decode.t @@ -2,8 +2,8 @@ # $Id: decode.t,v 1.1 2013/08/29 16:47:39 dankogai Exp $ # use strict; -use Encode qw(decode_utf8 FB_CROAK); -use Test::More tests => 3; +use Encode qw(decode_utf8 FB_CROAK find_encoding decode); +use Test::More tests => 4; sub croak_ok(&) { my $code = shift; @@ -23,3 +23,8 @@ croak_ok { Encode::decode('utf-8', $orig2, FB_CROAK) }; chop(my $new = $bytes . $pad); croak_ok { Encode::decode_utf8($new, FB_CROAK) }; +my $latin1 = find_encoding('latin1'); +$orig = "\N{U+0080}"; +$orig =~ /(.)/; +is($latin1->decode($1), $orig, '[cpan #115168] passing magic regex globals to decode'); + -- 2.8.1

Subject:

0001-cpan-115168-Decode-returns-wrong-data-with-magical-i.patch

From 8ccfe4df06b47883a8fe01229137c7a1c67e165c Mon Sep 17 00:00:00 2001 From: Dan Collins <dcollinsn@gmail.com> Date: Wed, 8 Jun 2016 09:52:17 -0400 Subject: [PATCH 1/2] [cpan #115168] Decode returns wrong data with magical inputs In [perl #128143], a user reports inconsistent behavior between the following two ways to call decode: $x =~ s/(.)/$latin1->decode($1)/ge $x =~ s/(.)/Encode::decode('latin1', $1)/ge The issue seems to be that Encode.xs Method_decode() performs some operations on src without checking to see if it is readonly or magical. Since $1 is magical, this results in incorrect data. From stepping through with GDB, it looks like the corruption doesn't happen until this line, early in encode_method(): U8 *s = (U8 *) SvPV(src, slen); The patch creates a mortal copy of src if it is readonly or magical. --- Encode.xs | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Encode.xs b/Encode.xs index cd7f7d1..0b11e80 100644 --- a/Encode.xs +++ b/Encode.xs @@ -647,8 +647,11 @@ CODE: int check; SV *fallback_cb = &PL_sv_undef; encode_t *enc = INT2PTR(encode_t *, SvIV(SvRV(obj))); + if (SvREADONLY(src) || SvSMAGICAL(src)) { + src = sv_mortalcopy(src); + } if (SvUTF8(src)) { - sv_utf8_downgrade(src, FALSE); + sv_utf8_downgrade(src, FALSE); } if (SvROK(check_sv)){ fallback_cb = check_sv; -- 2.8.1

Wed Jun 08 16:22:33 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Wed Jun 08 10:24:08 2016, dcollinsn@gmail.com wrote: Show quoted text

> Patches attached. > > I can't easily find a similar issue in encode(), but I might just be > using the wrong testcase.

I see a slight problem with the patch, or, rather, a bug that it fails to fix. I see you are using SvSMAGICAL, not SvGMAGICAL. If a GMAGICAL but not SMAGICAL scalar is passed, it will get the SvUTF8 flag checked too soon, because you can’t check the flag until after stringification. In fact, that applies to typeglobs and references as well. So maybe we just need a void SvPV before SvUTF8 check (after the ‘if’ block your patch adds), to make sure the flag has the right value.

Wed Jun 08 16:22:34 2016 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 08 19:05:51 2016 dcollinsn [...] gmail.com - Correspondence added

From:

dcollinsn [...] gmail.com

On Wed Jun 08 16:22:33 2016, SPROUT wrote: Show quoted text

> I see a slight problem with the patch, or, rather, a bug that it fails > to fix.

Unsurprising. I'm not really at the point where I can anticipate all the things that this might be passed, so I'm really just plugging holes. Show quoted text

> I see you are using SvSMAGICAL, not SvGMAGICAL. If a > GMAGICAL but not SMAGICAL scalar is passed, it will get the SvUTF8 > flag checked too soon, because you can’t check the flag until after > stringification.

Show quoted text

> > In fact, that applies to typeglobs and references as well. So maybe > we just need a void SvPV before SvUTF8 check (after the ‘if’ block > your patch adds), to make sure the flag has the right value.

Sounds like a plan. Some creative fuzzing also found a testcase that exposes a similar problem in encode, so I'm going to investigate that some more and look for other related failures before updating this ticket with a more complete patch.

Wed Jun 08 20:32:01 2016 dcollinsn [...] gmail.com - Correspondence added

From:

dcollinsn [...] gmail.com

On Wed Jun 08 16:22:33 2016, SPROUT wrote: Show quoted text

> In fact, that applies to typeglobs and references as well. So maybe > we just need a void SvPV before SvUTF8 check (after the ‘if’ block > your patch adds), to make sure the flag has the right value.

Turns out, you need quite a bit more than that. If you pass a typeglob in (why are you doing this), SvPV will give you a string representation and set SvUTF8, but doesn't actually set svu_pv - so we need a temporary copy anyway or there's no point in calling sv_utf8_downgrade. 656 SvPV(src, len); (gdb) p *src $2 = {sv_any = 0xbcaae0, sv_refcnt = 5, sv_flags = 32777, sv_u = { svu_pv = 0xaadbb0 "", svu_iv = 11197360, svu_uv = 11197360, svu_rv = 0xaadbb0, svu_rx = 0xaadbb0, svu_array = 0xaadbb0, svu_hash = 0xaadbb0, svu_gp = 0xaadbb0, svu_fp = 0xaadbb0}} (gdb) n 657 is_utf8 = SvUTF8(src); (gdb) p *src $3 = {sv_any = 0xbcaae0, sv_refcnt = 5, sv_flags = 536903689, sv_u = { svu_pv = 0xaadbb0 "", svu_iv = 11197360, svu_uv = 11197360, svu_rv = 0xaadbb0, svu_rx = 0xaadbb0, svu_array = 0xaadbb0, svu_hash = 0xaadbb0, svu_gp = 0xaadbb0, svu_fp = 0xaadbb0}} But tracking this down leads me to the question: Is $latin1->decode(*b) something that is supported? Because, I /can/ submit a patch that fixes that, but I don't know if it would be accepted.

Thu Jun 09 01:20:01 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Wed Jun 08 20:32:01 2016, dcollinsn@gmail.com wrote: Show quoted text

> But tracking this down leads me to the question: Is $latin1-

> >decode(*b) something that is supported? Because, I /can/ submit a

> patch that fixes that, but I don't know if it would be accepted.

I generally follow the view that *every* scalar in Perl is a string (the type depends on the use), and, since substr(*b,...) is supported, and any decoder written in pure-Perl would be able to handle it, ->decode(*b) should be supported. I think you actually need to copy and SvPV_force every input that is not SvPOK already, since sv_utf8_downgrade (I just looked in sv.c) is almost a no-op if SvPOK is not true. That simple approach would Just Work for objects with string overloading too (and those should definitely work as strings).

Thu Jun 09 21:18:43 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Thu Jun 09 01:20:01 2016, SPROUT wrote: Show quoted text

> On Wed Jun 08 20:32:01 2016, dcollinsn@gmail.com wrote:

> > But tracking this down leads me to the question: Is $latin1-

> > > decode(*b) something that is supported? Because, I /can/ submit a

> > patch that fixes that, but I don't know if it would be accepted.

> > I generally follow the view that *every* scalar in Perl is a string > (the type depends on the use), and, since substr(*b,...) is supported, > and any decoder written in pure-Perl would be able to handle it, > ->decode(*b) should be supported. > > I think you actually need to copy and SvPV_force every input that is > not SvPOK already, since sv_utf8_downgrade (I just looked in sv.c) is > almost a no-op if SvPOK is not true.

Gah! I’m not thinking! sv_copypv if !SvPOK would be a better approach. Show quoted text

> That simple approach would Just Work for objects with string > overloading too (and those should definitely work as strings).

Thu Jun 09 21:47:13 2016 dcollinsn [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #115168] [PATCH] Passing regex globals to decode() results in wrong result
Date:	Thu, 9 Jun 2016 21:46:41 -0400
To:	bug-Encode [...] rt.cpan.org
From:	Dan Collins <dcollinsn [...] gmail.com>

Either works. Is SV *tmp; tmp = sv_newmortal(); sv_copypv(tmp, src); src = tmp; really better than src = sv_mortalcopy(src); SvPV_force_nolen(src); ? On Thu, Jun 9, 2016 at 9:18 PM, Father Chrysostomos via RT < bug-Encode@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=115168 > > > On Thu Jun 09 01:20:01 2016, SPROUT wrote:

> > On Wed Jun 08 20:32:01 2016, dcollinsn@gmail.com wrote:

> > > But tracking this down leads me to the question: Is $latin1-

> > > > decode(*b) something that is supported? Because, I /can/ submit a

> > > patch that fixes that, but I don't know if it would be accepted.

> > > > I generally follow the view that *every* scalar in Perl is a string > > (the type depends on the use), and, since substr(*b,...) is supported, > > and any decoder written in pure-Perl would be able to handle it, > > ->decode(*b) should be supported. > > > > I think you actually need to copy and SvPV_force every input that is > > not SvPOK already, since sv_utf8_downgrade (I just looked in sv.c) is > > almost a no-op if SvPOK is not true.

> > Gah! I’m not thinking! sv_copypv if !SvPOK would be a better approach. >

> > That simple approach would Just Work for objects with string > > overloading too (and those should definitely work as strings).

> > > >

Thu Jun 09 22:09:26 2016 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Thu Jun 09 21:47:13 2016, dcollinsn@gmail.com wrote: Show quoted text

> Either works. Is > > SV *tmp; > tmp = sv_newmortal(); > sv_copypv(tmp, src); > src = tmp; > > really better than > > src = sv_mortalcopy(src); > SvPV_force_nolen(src); > > ?

SvPV_force is more complicated, so it’s slower. In fact, for a scalar that is more that just a PV (e.g., PVNV, PVIV, REGEXP, or PVGV), the new scalar ends up being heavier. For a REGEXP or PVGV you do end up copying a more complicated data structure unnecessarily. Also sv_copypv will stringify the original, causing uninitialized warnings to report the source SV. I know I’m being a bit pedantic. I’ve spent too much time looking at the perl source. :-)

Mon Jun 13 13:31:21 2016 dcollinsn [...] gmail.com - Correspondence added

From:

dcollinsn [...] gmail.com

Sorry for the delay, just getting back from a weekend trip. Attachment 3 replaces attachments 1 and 2, and fixes the additional issues noted by Sprout and discovered while investigating his suggestions. It creates a temporary, mortal, PV copy of any input SV to either encode or decode which is read-only, which has get or set magic, or which is not POK. Tests were added for the changed behavior, and no existing tests were impacted.

Subject:

0003-cpan-115168-Correct-magical-and-non-POK-inputs.patch

From 253098cb2d33d8b10fb377f776715e807c5cbb67 Mon Sep 17 00:00:00 2001 From: Dan Collins <dcollinsn@gmail.com> Date: Wed, 8 Jun 2016 10:15:29 -0400 Subject: [PATCH] [cpan #115168] Correct magical and non-POK inputs In [perl #128143], a user reports inconsistent behavior between the following two ways to call decode: $x =~ s/(.)/$latin1->decode($1)/ge $x =~ s/(.)/Encode::decode('latin1', $1)/ge The issue seems to be that Encode.xs Method_decode() performs some operations on src without checking to see if it is readonly or magical. Since $1 is magical, this results in incorrect data. From stepping through with GDB, it looks like the corruption doesn't happen until this line, early in encode_method(): U8 *s = (U8 *) SvPV(src, slen); For inputs that are readonly, magical, or not POK, we first create a mortal SV, then use sv_copypv to stringify and transfer. Encode needs to do the same thing. Currently, encode will not accept being passed, for example, a typeglob, but decode will. This patch brings the functions in line, and will stringify any input that sv_pvcopy is able to. Tests are added for both functions for both magic and typeglob inputs. --- Encode.xs | 14 +++++++++++++- t/Encode.t | 10 +++++++++- t/decode.t | 11 +++++++++-- 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/Encode.xs b/Encode.xs index cd7f7d1..b945eb4 100644 --- a/Encode.xs +++ b/Encode.xs @@ -647,8 +647,14 @@ CODE: int check; SV *fallback_cb = &PL_sv_undef; encode_t *enc = INT2PTR(encode_t *, SvIV(SvRV(obj))); + if (SvREADONLY(src) || SvSMAGICAL(src) || SvGMAGICAL(src) || !SvPOK(src)) { + SV *tmp; + tmp = sv_newmortal(); + sv_copypv(tmp, src); + src = tmp; + } if (SvUTF8(src)) { - sv_utf8_downgrade(src, FALSE); + sv_utf8_downgrade(src, FALSE); } if (SvROK(check_sv)){ fallback_cb = check_sv; @@ -672,6 +678,12 @@ CODE: int check; SV *fallback_cb = &PL_sv_undef; encode_t *enc = INT2PTR(encode_t *, SvIV(SvRV(obj))); + if (SvREADONLY(src) || SvSMAGICAL(src) || SvGMAGICAL(src) || !SvPOK(src)) { + SV *tmp; + tmp = sv_newmortal(); + sv_copypv(tmp, src); + src = tmp; + } sv_utf8_upgrade(src); if (SvROK(check_sv)){ fallback_cb = check_sv; diff --git a/t/Encode.t b/t/Encode.t index d490255..55592f0 100644 --- a/t/Encode.t +++ b/t/Encode.t @@ -25,7 +25,7 @@ my @character_set = ('0'..'9', 'A'..'Z', 'a'..'z'); my @source = qw(ascii iso8859-1 cp1250); my @destiny = qw(cp1047 cp37 posix-bc); my @ebcdic_sets = qw(cp1047 cp37 posix-bc); -plan test => 38+$n*@encodings + 2*@source*@destiny*@character_set + 2*@ebcdic_sets*256 + 6 + 5; +plan test => 38+$n*@encodings + 2*@source*@destiny*@character_set + 2*@ebcdic_sets*256 + 6 + 5 + 2; my $str = join('',map(chr($_),0x20..0x7E)); my $cpy = $str; ok(length($str),from_to($cpy,'iso8859-1','Unicode'),"Length Wrong"); @@ -164,3 +164,11 @@ $key = (keys %{{ "whatever" => '' }})[0]; $kopy = $key; decode("UTF-16LE", $kopy, Encode::FB_CROAK); ok $key, "whatever", 'decode with shared hash key scalars'; + +my $latin1 = find_encoding('latin1'); +my $orig = "\316"; +$orig =~ /(.)/; +ok $latin1->encode($1), $orig, '[cpan #115168] passing magic regex globals to encode'; + +*a = $orig; +ok $latin1->encode(*a), '*main::'.$orig, '[cpan #115168] passing typeglobs to encode'; diff --git a/t/decode.t b/t/decode.t index 77cdaba..8f7b64e 100644 --- a/t/decode.t +++ b/t/decode.t @@ -2,8 +2,8 @@ # $Id: decode.t,v 1.1 2013/08/29 16:47:39 dankogai Exp $ # use strict; -use Encode qw(decode_utf8 FB_CROAK); -use Test::More tests => 3; +use Encode qw(decode_utf8 FB_CROAK find_encoding decode); +use Test::More tests => 5; sub croak_ok(&) { my $code = shift; @@ -23,3 +23,10 @@ croak_ok { Encode::decode('utf-8', $orig2, FB_CROAK) }; chop(my $new = $bytes . $pad); croak_ok { Encode::decode_utf8($new, FB_CROAK) }; +my $latin1 = find_encoding('latin1'); +$orig = "\N{U+0080}"; +$orig =~ /(.)/; +is($latin1->decode($1), $orig, '[cpan #115168] passing magic regex globals to decode'); + +*a = $orig; +is($latin1->decode(*a), '*main::'.$orig, '[cpan #115168] passing typeglobs to decode'); -- 2.8.1

Tue Jun 14 04:02:08 2016 DANKOGAI [...] cpan.org - Correspondence added

Your patch is in: https://github.com/dankogai/p5-encode/commit/f37539b74f73327e6d1c85b245a064f6831edca0 Dan the Maintainer Thereof On Mon Jun 13 13:31:21 2016, dcollinsn@gmail.com wrote: Show quoted text

> Sorry for the delay, just getting back from a weekend trip. Attachment > 3 replaces attachments 1 and 2, and fixes the additional issues noted > by Sprout and discovered while investigating his suggestions. It > creates a temporary, mortal, PV copy of any input SV to either encode > or decode which is read-only, which has get or set magic, or which is > not POK. Tests were added for the changed behavior, and no existing > tests were impacted.

Tue Jun 14 04:02:30 2016 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'