Bug #115540 for Encode: from_to affecting COW strings

Wed Jun 22 13:31:35 2016 TEAM [...] cpan.org - Ticket created

Subject:

from_to affecting COW strings

The from_to() function is documented as modifying strings in place. This leads to unusual behaviour when COW is involved: $ perl -MTest::More -e'use utf8; use Encode; my $x = "täst"; is($x, "täst"); my $y = $x; Encode::from_to($y, "UTF-8", "iso-8859-1"); is($x, "täst"); done_testing' ok 1 not ok 2 and if I'm reading the output correctly here, it's left the original scalar in an inconsistent state: $ perl -MDevel::Peek -e'use utf8; use Encode; my $x = "täst"; Dump($x); my $y = $x; Encode::from_to($y, "UTF-8", "iso-8859-1"); Dump($x);' SV = PV(0x222ed70) at 0x224e580 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x22523b0 "t\303\244st"\0 [UTF8 "t\x{e4}st"] CUR = 5 LEN = 10 COW_REFCNT = 1 SV = PV(0x222ed70) at 0x224e580 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x22523b0 "t\344st\0"\0Malformed UTF-8 character (unexpected non-continuation byte 0x73, immediately after start byte 0xe4) in Dump at -e line 1. [UTF8 "t\x{0}\x{0}"] CUR = 5 LEN = 10 COW_REFCNT = 1 I think making a copy of the string first if COW_REFCNT > 1 would be less surprising. However, if the current behaviour is intentional, would it be possible to include a note in the documentation to highlight this? cheers, Tom

Wed Jun 22 13:53:55 2016 DANKOGAI [...] cpan.org - Correspondence added

That's because you "use utf8;", making utf8::is_utf8($x) true. And perldoc Encode explicitly states: Show quoted text

> The data in $octets must be encoded as octets and not as characters in Perl's internal format.

and % perl -MTest::More -e'use Encode; my $x = "täst"; is($x, "täst"); my $y = $x; Encode::from_to($y, "UTF-8", "iso-8859-1"); is($x, "täst"); done_testing' ok 1 ok 2 1..2 Dan the Maintainer Thereof On Wed Jun 22 13:31:35 2016, TEAM wrote: Show quoted text

> The from_to() function is documented as modifying strings in place. > This leads to unusual behaviour when COW is involved: > > $ perl -MTest::More -e'use utf8; use Encode; my $x = "täst"; is($x, > "täst"); my $y = $x; Encode::from_to($y, "UTF-8", "iso-8859-1"); > is($x, "täst"); done_testing' > ok 1 > not ok 2 > > and if I'm reading the output correctly here, it's left the original > scalar in an inconsistent state: > > $ perl -MDevel::Peek -e'use utf8; use Encode; my $x = "täst"; > Dump($x); my $y = $x; Encode::from_to($y, "UTF-8", "iso-8859-1"); > Dump($x);' > SV = PV(0x222ed70) at 0x224e580 > REFCNT = 1 > FLAGS = (POK,IsCOW,pPOK,UTF8) > PV = 0x22523b0 "t\303\244st"\0 [UTF8 "t\x{e4}st"] > CUR = 5 > LEN = 10 > COW_REFCNT = 1 > SV = PV(0x222ed70) at 0x224e580 > REFCNT = 1 > FLAGS = (POK,IsCOW,pPOK,UTF8) > PV = 0x22523b0 "t\344st\0"\0Malformed UTF-8 character (unexpected > non-continuation byte 0x73, immediately after start byte 0xe4) in Dump > at -e line 1. > [UTF8 "t\x{0}\x{0}"] > CUR = 5 > LEN = 10 > COW_REFCNT = 1 > > I think making a copy of the string first if COW_REFCNT > 1 would be > less surprising. However, if the current behaviour is intentional, > would it be possible to include a note in the documentation to > highlight this? > > cheers, > > Tom

Wed Jun 22 13:53:55 2016 The RT System itself - Status changed from 'new' to 'open'

Wed Jun 22 13:54:00 2016 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Jun 22 20:40:14 2016 TEAM [...] cpan.org - Correspondence added

Hi Dan, On 2016-06-22 18:53:55, DANKOGAI wrote: Show quoted text

> That's because you "use utf8;", making utf8::is_utf8($x) true. And > perldoc Encode explicitly states: >

> > The data in $octets must be encoded as octets and not as characters > > in Perl's internal format.

Sure; that part is fine, and I agree that the original code is incorrect. The documentation is clear on the expected inputs - what it does not mention is "if you pass bad data, this may affect any other scalar that shares the string data via COW". This means that if (when!) the user does pass an incorrectly-encoded scalar, the behaviour can be very hard to debug. What appears to be an *unrelated* scalar is being modified. This means that this innocent-looking code: my $x = "some string with Unïcode characters"; my $copy = $x; other_code($copy); # this (incorrectly) calls from_to() passing characters rather than bytes could modify the contents of $x. It'd require knowledge of the Perl internals to trace that. Patching this seems simple enough (attached is my attempt), although I have not run benchmarks to verify impact. Also I don't want to spam the RT queue so please let me know if there's a better place to discuss this (IRC, etc.). best regards, Tom

Subject:

RT115540.patch

commit 99961be505333ab348af3819fafef7afc0dd0ccf Author: tom <TEAM@cpan.org> Date: Thu Jun 23 01:28:39 2016 +0100 RT115540 Guard against unintentional COW changes diff --git a/Encode.xs b/Encode.xs index cd7f7d1..b9529c1 100644 --- a/Encode.xs +++ b/Encode.xs @@ -455,9 +455,10 @@ CODE: { dSP; ENTER; SAVETMPS; if (src == &PL_sv_undef || SvROK(src)) src = sv_2mortal(newSV(0)); + check = SvROK(check_sv) ? ENCODE_PERLQQ|ENCODE_LEAVE_SRC : SvIV(check_sv); + if (!(check & ENCODE_LEAVE_SRC) && SvIsCOW(src)) sv_force_normal(src); // disassociate from any other scalars before doing in-place modifications s = (U8 *) SvPV(src, slen); e = (U8 *) SvEND(src); - check = SvROK(check_sv) ? ENCODE_PERLQQ|ENCODE_LEAVE_SRC : SvIV(check_sv); /* * PerlIO check -- we assume the object is of PerlIO if renewed */ diff --git a/t/cow.t b/t/cow.t index 985e268..079d451 100644 --- a/t/cow.t +++ b/t/cow.t @@ -18,3 +18,10 @@ is $h{"L\x{e9}on"} => 'acme'; # use Devel::Peek; # Dump(\%h); +{ # invalid input to encode/decode/from_to should not affect COW-shared scalars + my $x = Encode::decode('UTF-8', "\303\244" x 4); + my $orig = "$x"; # non-COW copy + is($x, $orig, "copy of original string matches"); + { my $y = $x; Encode::from_to($y, "UTF-8", "iso-8859-1"); } + is($x, $orig, "original scalar unmodified after from_to() call"); +}

Thu Jun 23 02:40:43 2016 DANKOGAI [...] cpan.org - Correspondence added

Your patch is in. https://github.com/dankogai/p5-encode/commit/9aa5286c1ea323ca67ba9a769cb2ab4e10f1d5fe Dan the Maintainer Thereof On Wed Jun 22 20:40:14 2016, TEAM wrote: Show quoted text

> Hi Dan, > > On 2016-06-22 18:53:55, DANKOGAI wrote:

> > That's because you "use utf8;", making utf8::is_utf8($x) true. And > > perldoc Encode explicitly states: > >

> > > The data in $octets must be encoded as octets and not as characters > > > in Perl's internal format.

> > Sure; that part is fine, and I agree that the original code is > incorrect. The documentation is clear on the expected inputs - what it > does not mention is "if you pass bad data, this may affect any other > scalar that shares the string data via COW". > > This means that if (when!) the user does pass an incorrectly-encoded > scalar, the behaviour can be very hard to debug. > > What appears to be an *unrelated* scalar is being modified. This means > that this innocent-looking code: > > my $x = "some string with Unïcode characters"; > my $copy = $x; > other_code($copy); # this (incorrectly) calls from_to() passing > characters rather than bytes > > could modify the contents of $x. It'd require knowledge of the Perl > internals to trace that. > > Patching this seems simple enough (attached is my attempt), although I > have not run benchmarks to verify impact. Also I don't want to spam > the RT queue so please let me know if there's a better place to > discuss this (IRC, etc.). > > best regards, > > Tom