Bug #14078 for IPC-Run: run() miscalculates length of UTF-8 strings

Tue Aug 09 11:24:37 2005 domq [...] cpan.org - Ticket created

Subject:

run() miscalculates length of UTF-8 strings

IPC::Run::finish() appears to rely on length() to compute the number of bytes to send to the outbound pipe(s). Unfortunately this won't add up if the string being sent has the UTF-8 flag set (see L<perlunicode>) because length() returns the number of characters, not bytes, when applied to an UTF-8 string. Attached is a test case that demonstrates this by invoking "cat". Due to the miscalculation, the string sent to "cat" is truncated by two bytes. Using IPC::Run 0.79 from CPAN, Linux kernel version 2.6.11.7, Debian Woody. The test fails in the same way (subtests 1 to 3 pass, 4 fails) on both the distribution's stock Perl 5.6.1 and a home-built Perl 5.8.5.

use Test; BEGIN { plan tests => 4 }; use IPC::Run qw(run); use strict; require utf8; my $utf_in = do { use utf8; "rÃ©sumÃ©..." }; my $bytes_in = pack("C*", unpack("C*", $utf_in)); # $utf_in and $bytes_in now contain the same bytes in Perl's memory # (as per the pack/unpack above), and differ only by the UTF-8 flag. # Let's make sure about the latter ($utf_in should have the flag set, # and $bytes_in should have it cleared): skip ( (! utf8->can("is_utf8") ? "Skipped - Perl 5.8 required for this sub-test" : undef), sub { utf8::is_utf8($utf_in) && ! utf8::is_utf8($bytes_in) }); # Same purpose but less precise and also works under Perl 5.6: ok(length($bytes_in), 2 + length($utf_in), "UTF-8 flags and lengths don't add up"); my $bytes_out; # Test using the byte string: "cat" should be transparent. run ["cat"], \$bytes_in, \$bytes_out; ok($bytes_out, $bytes_in, "non-transparent IPC::Run (or runaway cat ?)"); # Same test using the UTF-8 string run ["cat"], \$utf_in, \$bytes_out; ok($bytes_out, $bytes_in, "non-transparent IPC::Run when using UTF-8");

Wed May 10 15:56:28 2006 RSOD [...] cpan.org - Status changed from 'new' to 'open'

Wed May 10 15:56:29 2006 RSOD [...] cpan.org - Given to RSOD

Wed May 10 15:56:29 2006 RSOD [...] cpan.org - Severity Important changed to Critical

Tue Feb 02 05:25:01 2010 james2vegas [...] aim.com - Correspondence added

This bug keeps GraphViz from working with UTF-8 strings, as the extra bytes not found by length() are usually the end of an edge definition, usually causing it to point to an invalid nodename, and the terminating }.

Mon Mar 22 20:07:24 2010 TODDR [...] cpan.org - Stolen from RSOD

Mon Mar 22 20:12:26 2010 TODDR [...] cpan.org - Correspondence added

Patches welcome. I'm not a UTF 8 expert. I think the big problem here is that any solution would have to be backwards compatible to 5.006. Alternativley we could simply document that you need to have perl 5.X in order for UTF-8 to function.

Tue Jan 04 20:21:15 2011 cpan.org [...] mavit.org.uk - Correspondence added

On Mon Mar 22 20:12:26 2010, TODDR wrote: Show quoted text

> Patches welcome. I'm not a UTF 8 expert. I think the big problem here is > that any solution would have to be backwards compatible to 5.006. > > Alternativley we could simply document that you need to have perl 5.X in > order for UTF-8 to function.

Assuming that the entire package is written expecting byte rather than character semantics, this simple patch should do the trick. The bytes pragma is available in Perl 5.6, so I don't see any reason why this shouldn't work there.

Subject:

ipc-run-utf8.diff

Index: t/utf8.t =================================================================== --- t/utf8.t (revision 0) +++ t/utf8.t (revision 0) @@ -0,0 +1,30 @@ +use Test; +BEGIN { plan tests => 3 }; + +use IPC::Run qw(run); +use strict; +require utf8; +use Encode; + +my $utf_in = do { use utf8; "rÃ©sumÃ©..." }; +my $bytes_in = pack("C*", unpack("C*", $utf_in)); + +# $utf_in and $bytes_in now contain the same bytes in Perl's memory +# (as per the pack/unpack above), and differ only by the UTF-8 flag. +# Let's make sure about the latter ($utf_in should have the flag set, +# and $bytes_in should have it cleared): +skip ( (! utf8->can("is_utf8") ? + "Skipped - Perl 5.8 required for this sub-test" : undef), + sub { utf8::is_utf8($utf_in) && ! utf8::is_utf8($bytes_in) }); + +my $bytes_out; + +# Test using the byte string: "cat" should be transparent. +run ["cat"], \$bytes_in, \$bytes_out; +ok($bytes_out, $bytes_in, + "non-transparent IPC::Run (or runaway cat ?)"); + +# Same test using the UTF-8 string +run ["cat"], \$utf_in, \$bytes_out; +ok(decode_utf8($bytes_out), $utf_in, + "non-transparent IPC::Run when using UTF-8"); Property changes on: t/utf8.t ___________________________________________________________________ Added: svn:eol-style + native Index: lib/IPC/Run.pm =================================================================== --- lib/IPC/Run.pm (revision 13566) +++ lib/IPC/Run.pm (working copy) @@ -1,4 +1,5 @@ package IPC::Run; +use bytes; =pod

Thu Apr 28 14:24:10 2011 kwilliams [...] cpan.org - Correspondence added

I'm getting bitten by this bug too, any timeline for a release fixing it? Until it's fixed, the input to run(@cmd, '<', \$input) is chopped off somewhere before the end. -Ken

Fri Apr 29 08:44:34 2011 TODDR [...] cpan.org - Correspondence added

On Thu Apr 28 14:24:10 2011, KWILLIAMS wrote: Show quoted text

> I'm getting bitten by this bug too, any timeline for a release fixing > it? Until it's fixed, the input > to run(@cmd, '<', \$input) is chopped off somewhere before the end. > > -Ken

Sorry all for the delay. The patch looks pretty good. I'll try to get a dev release out today.

Wed Jun 29 10:47:10 2011 dmuey [...] cpan.org - Correspondence added

Hey Todd, You'd asked me about this and basically the current solution of using bytes::length() is probably the way to go here***. So the solution 'use bytes;' should cover it. FWIW, my typical stance w/ bytes is to require it in to make its functions available and then explicitly use bytes:: or CORE:: so there is less confusion as to what was intended. As far as the test goes: I think the test could be simplified and more robust if it didn't use utf8 functions since it varies greatly between perl versions and usually does not do what you think it is doing. *** This demonstrates that bytes::length() is always accurate regardless of is the string is a byte string or unicode string (if it looks garbled then the first command should be the ellipses character itself (command-semicolon on OSX) after "string"): [dmuey@multivac ~]$ perl -Mstrict -wle 'require bytes;my $s="bytes string…";print bytes::length($s);print CORE::length($s);' 15 15 [dmuey@multivac ~]$ perl -Mstrict -wle 'require bytes;my $s="bytes string\xE2\x80\xA6";print bytes::length($s);print CORE::length($s);' 15 15 [dmuey@multivac ~]$ perl -Mstrict -wle 'require bytes;my $s="unicode string\x{2026}";print bytes::length($s);print CORE::length($s);' 17 15 [dmuey@multivac ~]$ perl -MEncode -Mstrict -wle 'require bytes;my $s=Encode::decode_utf8("bytes string…");print bytes::length($s);print CORE::length($s);' 15 13 [dmuey@multivac ~]$ perl -MEncode -Mstrict -wle 'require bytes;my $s=Encode::decode_utf8("bytes string\xE2\x80\xA6");print bytes::length($s);print CORE::length($s);' 15 13 [dmuey@multivac ~]$ perl -MEncode -Mstrict -wle 'require bytes;my $s=Encode::decode_utf8("unicode string\x{2026}");print bytes::length($s);print CORE::length($s);' 17 15 [dmuey@multivac ~]$

Wed Jun 29 11:31:52 2011 dmuey [...] cpan.org - Correspondence added

Show quoted text

> As far as the test goes: I think the test could be simplified and more > robust if it didn't use > utf8 functions since it varies greatly between perl versions and > usually does not do what you > think it is doing.

`git show HEAD` attached before 'use bytes' fix: [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t t/utf8....1..4 ok 1 - Encode::decode_utf8() does not lvalue our bytes string var ok 2 - byte string and unicode string same string as far as humans are concerned ok 3 - run() w/ byte string # Failed test 'run() w/ unicode string' not ok 4 - run() w/ unicode string # at t/utf8.t line 26. Wide character in print at /System/Library/Perl/5.10.0/Test/Builder.pm line 1266. # got: 'string�' # expected: 'string…' # Looks like you failed 1 test of 4. dubious Test returned status 1 (wstat 256, 0x100) DIED. FAILED test 4 Failed 1/4 tests, 75.00% okay Failed Test Stat Wstat Total Fail List of Failed ------------------------------------------------------------------------------- t/utf8.t 1 256 4 1 4 Failed 1/1 test scripts. 1/4 subtests failed. Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 CPU) Failed 1/1 test programs. 1/4 subtests failed. after 'use bytes' fix: [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t t/utf8....1..4 ok 1 - Encode::decode_utf8() does not lvalue our bytes string var ok 2 - byte string and unicode string same string as far as humans are concerned ok 3 - run() w/ byte string ok 4 - run() w/ unicode string ok All tests successful. Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 CPU)

Subject:

rt14078.tests.diff

commit 4b2111afd1beb4121d3c2a4e7b85586373cffc65 Author: Daniel Muey <dmuey@multivac.local> Date: Wed Jun 29 10:26:48 2011 -0500 simplify/robustify tests regarding utf-8 (via byte strings and unicode strings) diff --git a/t/utf8.t b/t/utf8.t index 75b5a59..9b0d668 100644 --- a/t/utf8.t +++ b/t/utf8.t @@ -1,30 +1,26 @@ -use Test; -BEGIN { plan tests => 3 }; +use Test::More tests => 4; -use IPC::Run qw(run); use strict; -require utf8; -use Encode; +use warnings; +use IPC::Run (); +use Encode (); -my $utf_in = do { use utf8; "râÂ©sumâÂ©..." }; -my $bytes_in = pack("C*", unpack("C*", $utf_in)); +##### data setup and sanity check +my $unicode_string = "string\x{2026}"; +my $byte_string = "string\xE2\x80\xA6"; -# $utf_in and $bytes_in now contain the same bytes in Perl's memory -# (as per the pack/unpack above), and differ only by the UTF-8 flag. -# Let's make sure about the latter ($utf_in should have the flag set, -# and $bytes_in should have it cleared): -skip ( (! utf8->can("is_utf8") ? - "Skipped - Perl 5.8 required for this sub-test" : undef), - sub { utf8::is_utf8($utf_in) && ! utf8::is_utf8($bytes_in) }); +## make sure what we're doing doesn't incidentally change the data and that the data is what we expect +my $x = Encode::decode_utf8($byte_string); +isnt( $x, $byte_string, "Encode::decode_utf8() does not lvalue our bytes string var" ); +is( $unicode_string, Encode::decode_utf8($byte_string), "byte string and unicode string same string as far as humans are concerned" ); +##### actual IPC::Run::run() tests my $bytes_out; -# Test using the byte string: "cat" should be transparent. -run ["cat"], \$bytes_in, \$bytes_out; -ok($bytes_out, $bytes_in, - "non-transparent IPC::Run (or runaway cat ?)"); +## Test using the byte string: "cat" should be transparent. +IPC::Run::run( ["cat"], \$byte_string, \$bytes_out ); +is( $bytes_out, $byte_string, "run() w/ byte string" ); -# Same test using the UTF-8 string -run ["cat"], \$utf_in, \$bytes_out; -ok(decode_utf8($bytes_out), $utf_in, - "non-transparent IPC::Run when using UTF-8"); +## Same test using the Unicode string +IPC::Run::run( ["cat"], \$unicode_string, \$bytes_out ); +is( Encode::decode_utf8($bytes_out), $unicode_string, "run() w/ unicode string" );

Thu Jun 30 10:42:55 2011 dmuey [...] cpan.org - Correspondence added

Attached is the test file (instead of the diff) as requested. Show quoted text

> `git show HEAD` attached > > before 'use bytes' fix: > > [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t > t/utf8....1..4 > ok 1 - Encode::decode_utf8() does not lvalue our bytes string var > ok 2 - byte string and unicode string same string as far as humans are > concerned > ok 3 - run() w/ byte string > > # Failed test 'run() w/ unicode string' > not ok 4 - run() w/ unicode string > # at t/utf8.t line 26. > Wide character in print at /System/Library/Perl/5.10.0/Test/Builder.pm > line 1266. > # got: 'string�' > # expected: 'string…' > # Looks like you failed 1 test of 4. > dubious > Test returned status 1 (wstat 256, 0x100) > DIED. FAILED test 4 > Failed 1/4 tests, 75.00% okay > Failed Test Stat Wstat Total Fail List of Failed > -----------------------------------------------------------------------------

-- Show quoted text

> t/utf8.t 1 256 4 1 4 > Failed 1/1 test scripts. 1/4 subtests failed. > Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 > CPU) > Failed 1/1 test programs. 1/4 subtests failed. > > after 'use bytes' fix: > > [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t > t/utf8....1..4 > ok 1 - Encode::decode_utf8() does not lvalue our bytes string var > ok 2 - byte string and unicode string same string as far as humans are > concerned > ok 3 - run() w/ byte string > ok 4 - run() w/ unicode string > ok > All tests successful. > Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 > CPU)

Subject:

utf8.t

use Test::More tests => 4; use strict; use warnings; use IPC::Run (); use Encode (); ##### data setup and sanity check my $unicode_string = "string\x{2026}"; my $byte_string = "string\xE2\x80\xA6"; ## make sure what we're doing doesn't incidentally change the data and that the data is what we expect my $x = Encode::decode_utf8($byte_string); isnt( $x, $byte_string, "Encode::decode_utf8() does not lvalue our bytes string var" ); is( $unicode_string, Encode::decode_utf8($byte_string), "byte string and unicode string same string as far as humans are concerned" ); ##### actual IPC::Run::run() tests my $bytes_out; ## Test using the byte string: "cat" should be transparent. IPC::Run::run( ["cat"], \$byte_string, \$bytes_out ); is( $bytes_out, $byte_string, "run() w/ byte string" ); ## Same test using the Unicode string IPC::Run::run( ["cat"], \$unicode_string, \$bytes_out ); is( Encode::decode_utf8($bytes_out), $unicode_string, "run() w/ unicode string" );

Fri Jul 01 00:34:28 2011 TODDR [...] cpan.org - Correspondence added

I've released The combined patches in 0.90_03. The patch can be found here: https://github.com/toddr/IPC-Run/commit/ed1dd80f31d42e50bf78c27f9b795f6f0741edb2

Fri Jul 01 00:34:46 2011 TODDR [...] cpan.org - Status changed from 'open' to 'patched'

Wed Jul 06 17:45:20 2011 TODDR [...] cpan.org - Status changed from 'patched' to 'resolved'