Skip Menu |

This queue is for tickets about the IPC-Run CPAN distribution.

Report information
The Basics
Id: 14078
Status: resolved
Priority: 0/
Queue: IPC-Run

People
Owner: TODDR [...] cpan.org
Requestors: dom [...] idealx.com
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 0.79
Fixed in: (no value)



Subject: run() miscalculates length of UTF-8 strings
IPC::Run::finish() appears to rely on length() to compute the number of bytes to send to the outbound pipe(s). Unfortunately this won't add up if the string being sent has the UTF-8 flag set (see L<perlunicode>) because length() returns the number of characters, not bytes, when applied to an UTF-8 string. Attached is a test case that demonstrates this by invoking "cat". Due to the miscalculation, the string sent to "cat" is truncated by two bytes. Using IPC::Run 0.79 from CPAN, Linux kernel version 2.6.11.7, Debian Woody. The test fails in the same way (subtests 1 to 3 pass, 4 fails) on both the distribution's stock Perl 5.6.1 and a home-built Perl 5.8.5.
use Test; BEGIN { plan tests => 4 }; use IPC::Run qw(run); use strict; require utf8; my $utf_in = do { use utf8; "résumé..." }; my $bytes_in = pack("C*", unpack("C*", $utf_in)); # $utf_in and $bytes_in now contain the same bytes in Perl's memory # (as per the pack/unpack above), and differ only by the UTF-8 flag. # Let's make sure about the latter ($utf_in should have the flag set, # and $bytes_in should have it cleared): skip ( (! utf8->can("is_utf8") ? "Skipped - Perl 5.8 required for this sub-test" : undef), sub { utf8::is_utf8($utf_in) && ! utf8::is_utf8($bytes_in) }); # Same purpose but less precise and also works under Perl 5.6: ok(length($bytes_in), 2 + length($utf_in), "UTF-8 flags and lengths don't add up"); my $bytes_out; # Test using the byte string: "cat" should be transparent. run ["cat"], \$bytes_in, \$bytes_out; ok($bytes_out, $bytes_in, "non-transparent IPC::Run (or runaway cat ?)"); # Same test using the UTF-8 string run ["cat"], \$utf_in, \$bytes_out; ok($bytes_out, $bytes_in, "non-transparent IPC::Run when using UTF-8");
This bug keeps GraphViz from working with UTF-8 strings, as the extra bytes not found by length() are usually the end of an edge definition, usually causing it to point to an invalid nodename, and the terminating }.

 
Patches welcome. I'm not a UTF 8 expert. I think the big problem here is that any solution would have to be backwards compatible to 5.006. Alternativley we could simply document that you need to have perl 5.X in order for UTF-8 to function.
On Mon Mar 22 20:12:26 2010, TODDR wrote: Show quoted text
> Patches welcome. I'm not a UTF 8 expert. I think the big problem here is > that any solution would have to be backwards compatible to 5.006. > > Alternativley we could simply document that you need to have perl 5.X in > order for UTF-8 to function.
Assuming that the entire package is written expecting byte rather than character semantics, this simple patch should do the trick. The bytes pragma is available in Perl 5.6, so I don't see any reason why this shouldn't work there.
Subject: ipc-run-utf8.diff
Index: t/utf8.t =================================================================== --- t/utf8.t (revision 0) +++ t/utf8.t (revision 0) @@ -0,0 +1,30 @@ +use Test; +BEGIN { plan tests => 3 }; + +use IPC::Run qw(run); +use strict; +require utf8; +use Encode; + +my $utf_in = do { use utf8; "résumé..." }; +my $bytes_in = pack("C*", unpack("C*", $utf_in)); + +# $utf_in and $bytes_in now contain the same bytes in Perl's memory +# (as per the pack/unpack above), and differ only by the UTF-8 flag. +# Let's make sure about the latter ($utf_in should have the flag set, +# and $bytes_in should have it cleared): +skip ( (! utf8->can("is_utf8") ? + "Skipped - Perl 5.8 required for this sub-test" : undef), + sub { utf8::is_utf8($utf_in) && ! utf8::is_utf8($bytes_in) }); + +my $bytes_out; + +# Test using the byte string: "cat" should be transparent. +run ["cat"], \$bytes_in, \$bytes_out; +ok($bytes_out, $bytes_in, + "non-transparent IPC::Run (or runaway cat ?)"); + +# Same test using the UTF-8 string +run ["cat"], \$utf_in, \$bytes_out; +ok(decode_utf8($bytes_out), $utf_in, + "non-transparent IPC::Run when using UTF-8"); Property changes on: t/utf8.t ___________________________________________________________________ Added: svn:eol-style + native Index: lib/IPC/Run.pm =================================================================== --- lib/IPC/Run.pm (revision 13566) +++ lib/IPC/Run.pm (working copy) @@ -1,4 +1,5 @@ package IPC::Run; +use bytes; =pod
I'm getting bitten by this bug too, any timeline for a release fixing it? Until it's fixed, the input to run(@cmd, '<', \$input) is chopped off somewhere before the end. -Ken
On Thu Apr 28 14:24:10 2011, KWILLIAMS wrote: Show quoted text
> I'm getting bitten by this bug too, any timeline for a release fixing > it? Until it's fixed, the input > to run(@cmd, '<', \$input) is chopped off somewhere before the end. > > -Ken
Sorry all for the delay. The patch looks pretty good. I'll try to get a dev release out today.
Hey Todd, You'd asked me about this and basically the current solution of using bytes::length() is probably the way to go here***. So the solution 'use bytes;' should cover it. FWIW, my typical stance w/ bytes is to require it in to make its functions available and then explicitly use bytes:: or CORE:: so there is less confusion as to what was intended. As far as the test goes: I think the test could be simplified and more robust if it didn't use utf8 functions since it varies greatly between perl versions and usually does not do what you think it is doing. *** This demonstrates that bytes::length() is always accurate regardless of is the string is a byte string or unicode string (if it looks garbled then the first command should be the ellipses character itself (command-semicolon on OSX) after "string"): [dmuey@multivac ~]$ perl -Mstrict -wle 'require bytes;my $s="bytes string…";print bytes::length($s);print CORE::length($s);' 15 15 [dmuey@multivac ~]$ perl -Mstrict -wle 'require bytes;my $s="bytes string\xE2\x80\xA6";print bytes::length($s);print CORE::length($s);' 15 15 [dmuey@multivac ~]$ perl -Mstrict -wle 'require bytes;my $s="unicode string\x{2026}";print bytes::length($s);print CORE::length($s);' 17 15 [dmuey@multivac ~]$ perl -MEncode -Mstrict -wle 'require bytes;my $s=Encode::decode_utf8("bytes string…");print bytes::length($s);print CORE::length($s);' 15 13 [dmuey@multivac ~]$ perl -MEncode -Mstrict -wle 'require bytes;my $s=Encode::decode_utf8("bytes string\xE2\x80\xA6");print bytes::length($s);print CORE::length($s);' 15 13 [dmuey@multivac ~]$ perl -MEncode -Mstrict -wle 'require bytes;my $s=Encode::decode_utf8("unicode string\x{2026}");print bytes::length($s);print CORE::length($s);' 17 15 [dmuey@multivac ~]$
Show quoted text
> As far as the test goes: I think the test could be simplified and more > robust if it didn't use > utf8 functions since it varies greatly between perl versions and > usually does not do what you > think it is doing.
`git show HEAD` attached before 'use bytes' fix: [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t t/utf8....1..4 ok 1 - Encode::decode_utf8() does not lvalue our bytes string var ok 2 - byte string and unicode string same string as far as humans are concerned ok 3 - run() w/ byte string # Failed test 'run() w/ unicode string' not ok 4 - run() w/ unicode string # at t/utf8.t line 26. Wide character in print at /System/Library/Perl/5.10.0/Test/Builder.pm line 1266. # got: 'string�' # expected: 'string…' # Looks like you failed 1 test of 4. dubious Test returned status 1 (wstat 256, 0x100) DIED. FAILED test 4 Failed 1/4 tests, 75.00% okay Failed Test Stat Wstat Total Fail List of Failed ------------------------------------------------------------------------------- t/utf8.t 1 256 4 1 4 Failed 1/1 test scripts. 1/4 subtests failed. Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 CPU) Failed 1/1 test programs. 1/4 subtests failed. after 'use bytes' fix: [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t t/utf8....1..4 ok 1 - Encode::decode_utf8() does not lvalue our bytes string var ok 2 - byte string and unicode string same string as far as humans are concerned ok 3 - run() w/ byte string ok 4 - run() w/ unicode string ok All tests successful. Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 CPU)
Subject: rt14078.tests.diff
commit 4b2111afd1beb4121d3c2a4e7b85586373cffc65 Author: Daniel Muey <dmuey@multivac.local> Date: Wed Jun 29 10:26:48 2011 -0500 simplify/robustify tests regarding utf-8 (via byte strings and unicode strings) diff --git a/t/utf8.t b/t/utf8.t index 75b5a59..9b0d668 100644 --- a/t/utf8.t +++ b/t/utf8.t @@ -1,30 +1,26 @@ -use Test; -BEGIN { plan tests => 3 }; +use Test::More tests => 4; -use IPC::Run qw(run); use strict; -require utf8; -use Encode; +use warnings; +use IPC::Run (); +use Encode (); -my $utf_in = do { use utf8; "résumé..." }; -my $bytes_in = pack("C*", unpack("C*", $utf_in)); +##### data setup and sanity check +my $unicode_string = "string\x{2026}"; +my $byte_string = "string\xE2\x80\xA6"; -# $utf_in and $bytes_in now contain the same bytes in Perl's memory -# (as per the pack/unpack above), and differ only by the UTF-8 flag. -# Let's make sure about the latter ($utf_in should have the flag set, -# and $bytes_in should have it cleared): -skip ( (! utf8->can("is_utf8") ? - "Skipped - Perl 5.8 required for this sub-test" : undef), - sub { utf8::is_utf8($utf_in) && ! utf8::is_utf8($bytes_in) }); +## make sure what we're doing doesn't incidentally change the data and that the data is what we expect +my $x = Encode::decode_utf8($byte_string); +isnt( $x, $byte_string, "Encode::decode_utf8() does not lvalue our bytes string var" ); +is( $unicode_string, Encode::decode_utf8($byte_string), "byte string and unicode string same string as far as humans are concerned" ); +##### actual IPC::Run::run() tests my $bytes_out; -# Test using the byte string: "cat" should be transparent. -run ["cat"], \$bytes_in, \$bytes_out; -ok($bytes_out, $bytes_in, - "non-transparent IPC::Run (or runaway cat ?)"); +## Test using the byte string: "cat" should be transparent. +IPC::Run::run( ["cat"], \$byte_string, \$bytes_out ); +is( $bytes_out, $byte_string, "run() w/ byte string" ); -# Same test using the UTF-8 string -run ["cat"], \$utf_in, \$bytes_out; -ok(decode_utf8($bytes_out), $utf_in, - "non-transparent IPC::Run when using UTF-8"); +## Same test using the Unicode string +IPC::Run::run( ["cat"], \$unicode_string, \$bytes_out ); +is( Encode::decode_utf8($bytes_out), $unicode_string, "run() w/ unicode string" );
Attached is the test file (instead of the diff) as requested. Show quoted text
> `git show HEAD` attached > > before 'use bytes' fix: > > [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t > t/utf8....1..4 > ok 1 - Encode::decode_utf8() does not lvalue our bytes string var > ok 2 - byte string and unicode string same string as far as humans are > concerned > ok 3 - run() w/ byte string > > # Failed test 'run() w/ unicode string' > not ok 4 - run() w/ unicode string > # at t/utf8.t line 26. > Wide character in print at /System/Library/Perl/5.10.0/Test/Builder.pm > line 1266. > # got: 'string�' > # expected: 'string…' > # Looks like you failed 1 test of 4. > dubious > Test returned status 1 (wstat 256, 0x100) > DIED. FAILED test 4 > Failed 1/4 tests, 75.00% okay > Failed Test Stat Wstat Total Fail List of Failed > -----------------------------------------------------------------------------
-- Show quoted text
> t/utf8.t 1 256 4 1 4 > Failed 1/1 test scripts. 1/4 subtests failed. > Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 > CPU) > Failed 1/1 test programs. 1/4 subtests failed. > > after 'use bytes' fix: > > [dmuey@multivac IPC-Run (RT14078)]$ prove -wlv t/utf8.t > t/utf8....1..4 > ok 1 - Encode::decode_utf8() does not lvalue our bytes string var > ok 2 - byte string and unicode string same string as far as humans are > concerned > ok 3 - run() w/ byte string > ok 4 - run() w/ unicode string > ok > All tests successful. > Files=1, Tests=4, 0 wallclock secs ( 0.05 cusr + 0.01 csys = 0.06 > CPU)
Subject: utf8.t
use Test::More tests => 4; use strict; use warnings; use IPC::Run (); use Encode (); ##### data setup and sanity check my $unicode_string = "string\x{2026}"; my $byte_string = "string\xE2\x80\xA6"; ## make sure what we're doing doesn't incidentally change the data and that the data is what we expect my $x = Encode::decode_utf8($byte_string); isnt( $x, $byte_string, "Encode::decode_utf8() does not lvalue our bytes string var" ); is( $unicode_string, Encode::decode_utf8($byte_string), "byte string and unicode string same string as far as humans are concerned" ); ##### actual IPC::Run::run() tests my $bytes_out; ## Test using the byte string: "cat" should be transparent. IPC::Run::run( ["cat"], \$byte_string, \$bytes_out ); is( $bytes_out, $byte_string, "run() w/ byte string" ); ## Same test using the Unicode string IPC::Run::run( ["cat"], \$unicode_string, \$bytes_out ); is( Encode::decode_utf8($bytes_out), $unicode_string, "run() w/ unicode string" );
I've released The combined patches in 0.90_03. The patch can be found here: https://github.com/toddr/IPC-Run/commit/ed1dd80f31d42e50bf78c27f9b795f6f0741edb2