Bug #28607 for Data-Dumper: Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead of utf8

Wed Aug 01 05:55:10 2007 me+cpan [...] bogen.net - Ticket created

Subject:

Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead of utf8

Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead of utf8. How to reproduce: <script1> #!/usr/bin/perl -w use strict; use utf8; use Encode; use Data::Dumper; my $I = {A => 'ü'}; if (Encode::is_utf8($I->{A})) { print "It is utf8.\n"; } print Data::Dumper::Dumper($I); </script1> The output is: me@madrid:~> ./ut.pl It is utf8. $VAR1 = { 'A' => "\x{fc}" }; me@madrid:~> \x{fc} is the 8bit/latin1 sign. :-( This should be \x{00C3}\x{00BC}. Feedback would be fine! Many thanks, -Martini

Mon Dec 10 18:48:10 2007 me+bitcard [...] bogen.net - Correspondence added

From:

martini [...] cpan.org

Any News? Any feedback? -Martin On Wed Aug 01 05:55:10 2007, MARTINI wrote: Show quoted text

> Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead > of utf8. > > How to reproduce: > <script1> > #!/usr/bin/perl -w > use strict; > use utf8; > use Encode; > use Data::Dumper; > > my $I = {A => 'ü'}; > if (Encode::is_utf8($I->{A})) { > print "It is utf8.\n"; > } > print Data::Dumper::Dumper($I); > </script1> > > The output is: > > me@madrid:~> ./ut.pl > It is utf8. > $VAR1 = { > 'A' => "\x{fc}" > }; > me@madrid:~> > > \x{fc} is the 8bit/latin1 sign. :-( This should be \x{00C3}\x{00BC}. > > Feedback would be fine! > > Many thanks, > > -Martini

Mon Dec 10 18:48:12 2007 The RT System itself - Status changed from 'new' to 'open'

Tue Dec 25 08:13:13 2007 EDAVIS [...] cpan.org - Correspondence added

From:

EDAVIS [...] cpan.org

I'm not the DD maintainer but I'd guess the following: The output of Data::Dumper must depend on your locale settings. If it needs to output, say, LATIN SMALL LETTER U WITH DIAERESIS (U+00FC), then if the terminal expects Latin-1 output it needs to output the Latin-1 byte sequence. Obviously if the terminal is only capable of understanding Latin-1 then it would be malformed output to produce UTF-8. Similarly, if the terminal is configured for UTF-8 (e.g. LC_ALL=en.UTF-8) then it would be malformed output to produce Latin-1 character sequences. So, what are your locale environment variables set for? % set | grep LC % set | grep LANG

Tue Dec 25 08:42:26 2007 me+bitcard [...] bogen.net - Correspondence added

From:

martini [...] cpan.org

Hi EDAVIS, On Tue Dec 25 08:13:13 2007, EDAVIS wrote: Show quoted text

> I'm not the DD maintainer but I'd guess the following: > > The output of Data::Dumper must depend on your locale settings. If it > needs to output, say, LATIN SMALL LETTER U WITH DIAERESIS (U+00FC), then > if the terminal expects Latin-1 output it needs to output the Latin-1 > byte sequence. Obviously if the terminal is only capable of > understanding Latin-1 then it would be malformed output to produce UTF-8. > > Similarly, if the terminal is configured for UTF-8 (e.g. > LC_ALL=en.UTF-8) then it would be malformed output to produce Latin-1 > character sequences. > > So, what are your locale environment variables set for? > > % set | grep LC > % set | grep LANG

This is the output: me@madrid:~> set | grep LC MAILCHECK=60 me@lancelot:~> set | grep LANG LANG=de_DE.UTF-8 me@madrid:~> It's perl 5.8.8 me@madrid:~> perl --version This is perl, v5.8.8 built for i586-linux-thread-multi Copyright 1987-2006, Larry Wall [...] So the coniguration looks ok or? Many thanks, -Martin

Fri Feb 19 10:33:59 2010 Bernhard.Schmalhofer [...] gmx.de - Correspondence added

From:

Bernhard.Schmalhofer [...] gmx.de

Am Mi 01. Aug 2007, 05:55:10, MARTINI schrieb: Show quoted text

> Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead > of utf8. > > How to reproduce: > <script1> > #!/usr/bin/perl -w > use strict; > use utf8; > use Encode; > use Data::Dumper; > > my $I = {A => 'ü'}; > if (Encode::is_utf8($I->{A})) { > print "It is utf8.\n"; > } > print Data::Dumper::Dumper($I); > </script1> > > The output is: > > me@madrid:~> ./ut.pl > It is utf8. > $VAR1 = { > 'A' => "\x{fc}" > }; > me@madrid:~> > > \x{fc} is the 8bit/latin1 sign. :-( This should be \x{00C3}\x{00BC}. >

Hi Martin, I think that Data::Dumper does the right thing. Data::Dumper creates Perl-code, and quotes wide characters in strings with the \x notation. The code point for 'ü' is U+00FC, encoded in UTF-8 it is the two bytes 0xC3 and 0xBC. The \x notation is encoding agnostic and used the codepoint. Therefore "\{fc}" is the same thing as an 'ü'. Here is some sample code: #!/usr/bin/perl use strict; use warnings; use utf8; use Encode; use Data::Dumper; use DBI; # print Unicode binmode STDOUT, ':utf8'; my $I = {A => 'ü'}; if (Encode::is_utf8($I->{A})) { print "It is utf8.\n"; } print Data::Dumper::Dumper($I); if ( "\x{00FC}" eq "\x{FC}" ) { print "an ü is an ü\n"; } if ( 'ü' eq "\x{FC}" ) { print "an ü is still an ü\n"; } if ( 'ü' eq "\x{00FC}" ) { print "an ü is still an ü\n"; } # latin1 string with two characters my $c3_bc = pack 'W2', 0xc3,0xbc; print "'$c3_bc':" . DBI::data_string_desc( $c3_bc ), "\n"; # Unicode string with two characters my $decoded_c3_bc = Encode::decode_utf8( $c3_bc ); print "'$decoded_c3_bc':", DBI::data_string_desc( $decoded_c3_bc ), "\n"; print Dumper( $c3_bc, $decoded_c3_bc ), "\n"; # a two character string is never equal to a single character string if ( $c3_bc eq $decoded_c3_bc ) { print "now I'm confused\n"; }

Tue Jan 04 10:25:58 2011 smueller [...] cpan.org - Correspondence added

RT-Send-CC:

EDAVIS [...] cpan.org, Bernhard.Schmalhofer [...] gmx.de

Hi, I am marking this bug as rejected since Bernhard's explanation seems reasonable to me. (I am not *really* the DD maintainer either, but I seem to be the only one who tends to the bug list on rt.cpan.org. It is maintained by p5p.) Martin, this is intended as any kind of disrespect. We are thankful for bug reports. If you disagree with my closing this ticket, then, by all means, feel free to reopen the ticket with a simple reply. Best regards, Steffen

Tue Jan 04 10:25:59 2011 smueller [...] cpan.org - Status changed from 'open' to 'rejected'

Bug #28607 for Data-Dumper: Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead of utf8

Preferred bug tracker