Bug #87434 for Carp: feeding decoded utf8, and then passing encoded utf8 to Carp::carp() produces garbage.

Tue Jul 30 07:08:57 2013 dmaki [...] cpan.org - Ticket created

Subject:

feeding decoded utf8, and then passing encoded utf8 to Carp::carp() produces garbage.

I don't know if my suggested fix is feasible, but at least it fixed my issue. (Below code copied from https://gist.github.com/lestrrat/6112039) use strict; use Encode; use Carp; sub feedme_decoded_utf8 { my $encoded_utf8 = encode_utf8($_[0]); # This trace contains the string: # "feedme_decoded_utf8($decoded_utf8) called at..." # which gets concatenated with $encoded_utf8, # which causes garbage to be printed carp $encoded_utf8; } sub feedme_encoded_utf8 { my $decoded_utf8 = decode_utf8($_[0]); # This trace contains the string: # "feedme_encoded_utf8($encoded_utf8) called at..." # which gets concatenated with $decoded_utf8. # but $encoded_utf8 is properly printed because # Carp::format_arg properly escapes, so no garbage # is printed. carp $decoded_utf8; } # prints out garbage feedme_decoded_utf8( decode_utf8("テスト") ); # prints out ok feedme_encoded_utf8( "テスト" ); # My suggestion to fix this: Steal Data::Dumper::qquote(), and put it # right here: # https://metacpan.org/source/ZEFRAM/Carp-1.26/lib/Carp.pm#L201

Fri Aug 23 19:01:27 2013 zefram [...] fysh.org - Correspondence added

Subject:	Re: [rt.cpan.org #87434] feeding decoded utf8, and then passing encoded utf8 to Carp::carp() produces garbage.
Date:	Sat, 24 Aug 2013 00:01:08 +0100
To:	Daisuke Maki via RT <bug-Carp [...] rt.cpan.org>
From:	Zefram <zefram [...] fysh.org>

You have in part brought your problem on yourself. The line: Show quoted text

> carp $encoded_utf8;

is requesting output of the characters corresponding to the octets of the UTF-8 encoding. These are characters from the high half of Latin-1, not the characters that the UTF-8 octets represent. You have asked for mojibake, and shouldn't be surprised that you got it. However, another part of your situation comes from awkward behaviour of Perl. By default, Perl treats a textual I/O handle as using the Latin-1 encoding. This means that a string containing non-Latin-1 characters cannot be correctly emitted. But if an attempt is made to output such a string, Perl does not treat it as a fatal error; instead, it treats the output handle as using UTF-8 encoding *just for that string*. It also emits the warning "Wide character" to tell you it's doing that. I think, from your description of the output, that your terminal is actually expecting UTF-8. This means that Perl's default treatment of output handles (Latin-1, remember) is not correct for you. Wherever you use non-ASCII characters, you *should* tell Perl what encoding to use on I/O handles. That you do not is a bug: you are not using Perl correctly. Now, I'll examine the output you get. In the feedme_encoded_utf8 case, the sub's parameter is a string of 9 octets with their high bit set, and $decoded_utf8 ends up containing three katakana characters. When you try to output that string as a warning, these characters can't go into the output stream properly, because they have no encoding in Latin-1. So Perl issues a warning and then outputs their UTF-8 encoding instead. The UTF-8 happens to be what your terminal is expecting, so your terminal shows you the katakana. The two bugs (yours and Perl's) cancel out, and so by accident you see the string Perl was actually trying to show you, which is the katakana. In the feedme_decoded_utf8 case, the sub's parameter is a string of three katakana characters, and $encoded_utf8 ends up containing 9 octets with their high bit set, or equivalently 9 characters from the upper half of Latin-1 (mostly accented Latin letters). When you try to output that as a warning, this 9-character string that you supplied gets concatenated with the stack trace. For murky historical reasons, the sub's parameter doesn't get quoted in this case, so the stack trace contains the actual katakana characters. So the complete warning string includes those accented Latin letters, and (further on) three katakana characters. As before, the katakana won't fit into Latin-1 encoding for output, so, as before, Perl outputs the warning message encoded in UTF-8 instead. This means that, as before, by accident you correctly see the string Perl was actually trying to show you. You see the mojibake that you asked for. If Carp is amended to quote all subroutine arguments, which is essentially what you suggest, this changes the feedme_decoded_utf8 case. With the sub's parameter represented in ASCII, the complete warning string now does not contain any katakana. Of course, it still contains the 9 high-half Latin-1 characters, the mojibake, that you supplied as the primary message. Now the entire warning string can be encoded in Latin-1, so Perl does encode it that way for output, and does not emit a "Wide character" warning. But because your terminal's character encoding is mismatched with Perl's treatment of the output handle, you don't correctly see the mojibake that you asked for. Instead, because your terminal is expecting UTF-8, it accidentally undoes the mojibake, and shows you the three katakana characters. Although you preferred this output, it is not correct. I think it *is* sensible for Carp to escape all subroutine arguments, representing them entirely in ASCII. But this is not for your reasons. -zefram

Fri Aug 23 19:01:28 2013 The RT System itself - Status changed from 'new' to 'open'

Fri Aug 23 19:01:41 2013 ZEFRAM [...] cpan.org - Status changed from 'open' to 'rejected'

Fri Aug 23 19:25:21 2013 dmaki [...] cpan.org - Correspondence added

What what what?! No, that's not what's happening at all. It has nothing to do with terminal settings. This is caused by Carp casually assuming that it's safe to append stacktraces w/o checking if it's decoded / encoded. Let's take this example back sub feedme_decoded_utf8 { my $encoded_utf8 = encode_utf8($_[0]); # This trace contains the string: # "feedme_decoded_utf8($decoded_utf8) called at..." # which gets concatenated with $encoded_utf8, # which causes garbage to be printed carp $encoded_utf8; } When you pass a decoded utf8 into this sub and carp() gets called. It tries to generate a stacktrace string. carp() got an encoded utf8, and this line, https://metacpan.org/source/ZEFRAM/Carp-1.26/lib/Carp.pm#L279 casually joins the encoded string, and then attempts to generate the rest of the stacktrace. at L279, $err contains an encoded utf8 string sequence. However, @_ for feedme_decoded_utf8() contains a decoded utf8, which doesn't get escaped because its utf8 flag is on https://metacpan.org/source/ZEFRAM/Carp-1.26/lib/Carp.pm#L204 so that means we now get $encoded_utf8 + $decoded_utf8 = garbled. On the second case, $err is made by concatenating $decoded_utf8. @_ for feedme_encoded_utf8 contains an $encode_utf8, so is_utf8() is false, and hence the escape mechanism kicks in, making everything latin-1 compatible. Now you get $decoded_utf8 + $latin1_whetever = printable. Doesn't matter what terminal settings I have. the first example will always be a concatenation of encoded bits and decoded octets, so it will never print properly.

Fri Aug 23 19:25:22 2013 dmaki [...] cpan.org - Status changed from 'rejected' to 'open'

Fri Aug 23 19:50:45 2013 zefram [...] fysh.org - Correspondence added

Subject:	Re: [rt.cpan.org #87434] feeding decoded utf8, and then passing encoded utf8 to Carp::carp() produces garbage.
Date:	Sat, 24 Aug 2013 00:50:29 +0100
To:	Daisuke Maki via RT <bug-Carp [...] rt.cpan.org>
From:	Zefram <zefram [...] fysh.org>

Daisuke Maki via RT wrote: Show quoted text

>No, that's not what's happening at all. It has nothing to do with >terminal settings.

Terminal settings are a factor in what you actually see. Since you didn't explicitly say what output you're getting, but just say whether it's "garbage", I presume that you're examining the output by sending it to a terminal. If you send it through od or something like that, we can talk about what octets come out. The rest of the analysis will be the same. Show quoted text

>This is caused by Carp casually assuming that it's safe to append >stacktraces w/o checking if it's decoded / encoded.

There is no way to check whether some string that Carp finds is intended to be treated as an octet string ("encoded") or as a character string ("decoded"). The only exception is the main argument to carp() (or cluck(), etc.), which, as it's a message, is implicitly a character string. If you pass in an octet string there, you're asking for mojibake. -zefram

Fri Aug 23 19:51:20 2013 ZEFRAM [...] cpan.org - Status changed from 'open' to 'rejected'

Fri Aug 23 20:56:14 2013 dmaki [...] cpan.org - Correspondence added

Are you suggesting that this is supposed to show up correctly based on my terminal settings? perl -Mutf8 -MEncode -E 'say "日本語" . Encode::encode_utf8("日本語")' Because that's what basically what Carp is currently doing to its stacktrace.

Fri Aug 23 22:26:37 2013 tokuhirom+cpan [...] gmail.com - Correspondence added

RT-Send-CC:

zefram [...] fysh.org

The issue is around UTF-8 automatic upgrading while string concatenation. More concrete code is here: http://nopaste.64p.org/entry/0F981E04-0C63-11E3-9DDC-3AC23A9B6EE1 And generated result is here: http://gyazo.64p.org/image/a971f9035cd5593ba7840d9dd9e90a41.png Because, "て(flagged string)" is flagged string but it's not downgraded correctly in format_arg function. % perl -e 'use utf8; $x="て(flagged string)"; warn utf8::downgrade($x, 1);' Warning: something's wrong at -e line 1. % perl -e 'use utf8; $x="て(flagged string)"; warn utf8::downgrade($x);' Wide character in subroutine entry at -e line 1. .... utf8::downgrade can't downgrade Japanese characters. Then, flagged utf8 characters returned from format_arg and other encoded string will concatenate. It makes mojibake.

Sat Aug 24 01:07:05 2013 MIYAGAWA [...] cpan.org - Correspondence added

On Fri Aug 23 19:50:45 2013, zefram@fysh.org wrote: Show quoted text

> >This is caused by Carp casually assuming that it's safe to append > >stacktraces w/o checking if it's decoded / encoded.

> > There is no way to check whether some string that Carp finds is intended > to be treated as an octet string ("encoded") or as a character string > ("decoded").

I agree, and that's the source of all these problems. Show quoted text

> The only exception is the main argument to carp() (or > cluck(), etc.), which, as it's a message, is implicitly a character > string.

That's new to me. I read perldoc Carp and didn't find any mention to it. Since you said "implicitly", i guess it's ... implicit :) Show quoted text

> If you pass in an octet string there, you're asking for mojibake.

I agree with you that if you pass in an octet with/without binmoding STDERR, the semantics for carp() is kind of undefined. However, what is annoying here is that when you pass octets (say UTF-8) to Carp::carp(), what gets printed to STDERR *depends on how the function in question is called*. Consider: use Encode; use Carp; sub f { my $b = encode_utf8($_[0]); carp $b; } sub g { carp $_[0]; } f decode_utf8('テスト'); g 'テスト'; warn 'テスト'; We call carp() twice and warn, all of which with utf-8 octet of テスト as an argument. What will be printed (in UTF-8 terminal) are: Wide character in warn at /Users/miyagawa/.plenv/versions/5.18.1/lib/perl5/5.18.1/Carp.pm line 102. ãã¹ã at - line 6. main::f('テスト') called at - line 13 テスト at - line 10. main::g('\x{e3}\x{83}\x{86}\x{e3}\x{82}\x{b9}\x{e3}\x{83}\x{88}') called at - line 14 テスト at - line 15. Again, I agree that the semantics is specific to the terminal environment, but i think at least the output should be consistent. Only the first one (sub f) gets mojibake, precisely because of the reason lestrrat and tokuhirom explained (the strings were concatinated internally inside Carp.pm and gets auto promotion from latin-1 to UTF-8). I wish it would rather print: テスト at - line 6 main::f('\x{30c6}\x{30b9}\x{30c8}') called at - line 13 instead - and i believe that's what original post requested for.

Sat Aug 24 01:38:27 2013 MIYAGAWA [...] cpan.org - Correspondence added

On Sat Aug 24 01:07:05 2013, MIYAGAWA wrote: Show quoted text

> Consider: > > use Encode; > use Carp; > > sub f { > my $b = encode_utf8($_[0]); > carp $b; > } > > sub g { > carp $_[0]; > } > > f decode_utf8('テスト'); > g 'テスト'; > warn 'テスト';

Compare, by changing 'テスト' to "Léon", saved in UTF-8: use strict; use Encode; use Carp; sub f { my $b = encode_utf8($_[0]); carp $b; } sub g { carp $_[0]; } f decode_utf8("Léon"); g "Léon"; warn "Léon"; prints: Léon at tmp/f.pl line 7. main::f('L\x{e9}on') called at tmp/f.pl line 14 Léon at tmp/f.pl line 11. main::g('L\x{c3}\x{a9}on') called at tmp/f.pl line 15 Léon at tmp/f.pl line 16. You said "When you feed octet, you're asking for mojibake" - but we don't get mojibake here for all three cases, with same code, and it's not fair :) I guess you don't need a technical explanation, but for the record, the actual reason that this works is that downgrade() in line 201 effectively strips the utf8 flag, and then is_utf8() in line 211 returns false and replaces all high bit characters with \x{XX} - it doesn't work that way with wide (non-latin) characters. 197 # Quote it? 198 # Downgrade, and use [0-9] rather than \d, to avoid loading 199 # Unicode tables, which would be liable to fail if we're 200 # processing a syntax error. 201 downgrade($arg, 1); 202 $arg = "'$arg'" unless $arg =~ /^-?[0-9.]+\z/; This comment somehow lets me me think that the call for downgrade() (and later is_utf8() returning false) is accidental.

Tue Aug 27 12:43:45 2013 ether [...] cpan.org - Cc ETHER added