Bug #33989 for encoding-warnings: encoding::warnings implicitly converts 8-bit literals

Tue Mar 11 09:37:02 2008 allter [...] gmail.com - Ticket created

Subject:

encoding::warnings implicitly converts 8-bit literals

When encoding::warnings is "use"d, the literals (quoted bytes) containing 8-bit data are implicitly converted to Unicode using latin-1 converter (without the promised warning). As I hacked into, it turned out to be that way, because import define ${^ENCODING} global variable with an object of its own class. Perl interpreter then calls ${^ENCODING}->cat_decode( .. ) which converts all source 8-bit literals to Unicode strings which can lead to miscellaneous side effects. The solution is to make "sub cat_decode" more complex (like "sub decode"). Attached file is a simple test script which shows the presence of side effect. If the maintainer is interested, I can prepare a patch for this issue since I have some ideas on it. INFO: $encoding::warnings::VERSION = '0.11'; bash-3.2$ perl -v This is perl, v5.10.0 built for cygwin-thread-multi-64int (with 3 registered patches, see perl -V for more detail) bash-3.2$ uname -a CYGWIN_NT-5.0 perldev1 1.5.25(0.156/4/2) 2007-12-14 19:21 i686 Cygwin

Subject:

scope.t

#!/usr/bin/perl use Test::Simple tests => 3; my $byte_string_8bit = 'ÀÁÂÃÄÅ'; # Bytes corresponding to first 6 letters of cyrillic ABC @cp1251 my $again_byte_string_8bit = do { use encoding::warnings; # Comment this line to see how things must work 'ÀÁÂÃÄÅ'; # Again this string must be a byte string since we requested only warnings }; my $another_byte_string_8bit = 'ÀÁÂÃÄÅ'; # Must be a byte string because it's out of scope of "use encoding::warnings" (it could even be in an unrelated place in other file) ok utf8::is_utf8( $byte_string_8bit ) == 0, 'Byte strings are byte strings'; ok utf8::is_utf8( $again_byte_string_8bit ) == 0, 'Again byte strings are STILL byte strings'; ok utf8::is_utf8( $another_byte_string_8bit ) == 0, 'Another byte strings are STILL byte strings';

Tue Mar 11 10:06:44 2008 allter [...] gmail.com - Correspondence added

From:

allter [...] gmail.com

As an addition, this issue is tightly related to the same issue in encoding::source that I found: http://rt.cpan.org/Public/Bug/Display.html?id=33990

Tue Mar 11 10:18:59 2008 audreyt [...] audreyt.org - Correspondence added

Subject:	Re: [rt.cpan.org #33989] encoding::warnings implicitly converts 8-bit literals
Date:	Tue, 11 Mar 2008 22:18:31 +0800
To:	bug-encoding-warnings [...] rt.cpan.org
From:	Audrey Tang <audreyt [...] audreyt.org>

Andrey M. Smirnov via RT 提到: Show quoted text

> If the maintainer is interested, I can prepare a patch for this issue > since I have some ideas on it.

Sure! Please do. Audrey

Tue Mar 11 10:19:06 2008 The RT System itself - Status changed from 'new' to 'open'

Fri Jun 13 21:26:03 2008 allter [...] gmail.com - Correspondence added

From:

allter [...] gmail.com

Tue. Mar. 11 09:37:02 2008, allter wrote: Show quoted text

> The solution is to make "sub cat_decode" more complex (like "sub decode").

[...] Show quoted text

> If the maintainer is interested, I can prepare a patch for this issue > since I have some ideas on it.

Unfortunately, perl is setting utf8 flag on strings received from cat_decode if ${^ENCODING} is set (i.e. at compile-time): toke.c, line 11858: if (has_utf8 || PL_encoding) SvUTF8_on(sv); So the complete solution is [probably?] impossible. I will think how is better to implement half-measures. Meanwhile any ideas are welcome...