Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Data-Dumper CPAN distribution.

Report information
The Basics
Id: 28607
Status: rejected
Priority: 0/
Queue: Data-Dumper

People
Owner: Nobody in particular
Requestors: martini [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 2.12_02
Fixed in: (no value)



Subject: Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead of utf8
Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead of utf8. How to reproduce: <script1> #!/usr/bin/perl -w use strict; use utf8; use Encode; use Data::Dumper; my $I = {A => 'ü'}; if (Encode::is_utf8($I->{A})) { print "It is utf8.\n"; } print Data::Dumper::Dumper($I); </script1> The output is: me@madrid:~> ./ut.pl It is utf8. $VAR1 = { 'A' => "\x{fc}" }; me@madrid:~> \x{fc} is the 8bit/latin1 sign. :-( This should be \x{00C3}\x{00BC}. Feedback would be fine! Many thanks, -Martini
From: martini [...] cpan.org
Any News? Any feedback? -Martin On Wed Aug 01 05:55:10 2007, MARTINI wrote: Show quoted text
> Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead > of utf8. > > How to reproduce: > <script1> > #!/usr/bin/perl -w > use strict; > use utf8; > use Encode; > use Data::Dumper; > > my $I = {A => 'ü'}; > if (Encode::is_utf8($I->{A})) { > print "It is utf8.\n"; > } > print Data::Dumper::Dumper($I); > </script1> > > The output is: > > me@madrid:~> ./ut.pl > It is utf8. > $VAR1 = { > 'A' => "\x{fc}" > }; > me@madrid:~> > > \x{fc} is the 8bit/latin1 sign. :-( This should be \x{00C3}\x{00BC}. > > Feedback would be fine! > > Many thanks, > > -Martini
From: EDAVIS [...] cpan.org
I'm not the DD maintainer but I'd guess the following: The output of Data::Dumper must depend on your locale settings. If it needs to output, say, LATIN SMALL LETTER U WITH DIAERESIS (U+00FC), then if the terminal expects Latin-1 output it needs to output the Latin-1 byte sequence. Obviously if the terminal is only capable of understanding Latin-1 then it would be malformed output to produce UTF-8. Similarly, if the terminal is configured for UTF-8 (e.g. LC_ALL=en.UTF-8) then it would be malformed output to produce Latin-1 character sequences. So, what are your locale environment variables set for? % set | grep LC % set | grep LANG
From: martini [...] cpan.org
Hi EDAVIS, On Tue Dec 25 08:13:13 2007, EDAVIS wrote: Show quoted text
> I'm not the DD maintainer but I'd guess the following: > > The output of Data::Dumper must depend on your locale settings. If it > needs to output, say, LATIN SMALL LETTER U WITH DIAERESIS (U+00FC), then > if the terminal expects Latin-1 output it needs to output the Latin-1 > byte sequence. Obviously if the terminal is only capable of > understanding Latin-1 then it would be malformed output to produce UTF-8. > > Similarly, if the terminal is configured for UTF-8 (e.g. > LC_ALL=en.UTF-8) then it would be malformed output to produce Latin-1 > character sequences. > > So, what are your locale environment variables set for? > > % set | grep LC > % set | grep LANG
This is the output: me@madrid:~> set | grep LC MAILCHECK=60 me@lancelot:~> set | grep LANG LANG=de_DE.UTF-8 me@madrid:~> It's perl 5.8.8 me@madrid:~> perl --version This is perl, v5.8.8 built for i586-linux-thread-multi Copyright 1987-2006, Larry Wall [...] So the coniguration looks ok or? Many thanks, -Martin
From: Bernhard.Schmalhofer [...] gmx.de
Am Mi 01. Aug 2007, 05:55:10, MARTINI schrieb: Show quoted text
> Data::Dumper::Dumper is not dumping utf8 strings as latin1/8bit instead > of utf8. > > How to reproduce: > <script1> > #!/usr/bin/perl -w > use strict; > use utf8; > use Encode; > use Data::Dumper; > > my $I = {A => 'ü'}; > if (Encode::is_utf8($I->{A})) { > print "It is utf8.\n"; > } > print Data::Dumper::Dumper($I); > </script1> > > The output is: > > me@madrid:~> ./ut.pl > It is utf8. > $VAR1 = { > 'A' => "\x{fc}" > }; > me@madrid:~> > > \x{fc} is the 8bit/latin1 sign. :-( This should be \x{00C3}\x{00BC}. >
Hi Martin, I think that Data::Dumper does the right thing. Data::Dumper creates Perl-code, and quotes wide characters in strings with the \x notation. The code point for 'ü' is U+00FC, encoded in UTF-8 it is the two bytes 0xC3 and 0xBC. The \x notation is encoding agnostic and used the codepoint. Therefore "\{fc}" is the same thing as an 'ü'. Here is some sample code: #!/usr/bin/perl use strict; use warnings; use utf8; use Encode; use Data::Dumper; use DBI; # print Unicode binmode STDOUT, ':utf8'; my $I = {A => 'ü'}; if (Encode::is_utf8($I->{A})) { print "It is utf8.\n"; } print Data::Dumper::Dumper($I); if ( "\x{00FC}" eq "\x{FC}" ) { print "an ü is an ü\n"; } if ( 'ü' eq "\x{FC}" ) { print "an ü is still an ü\n"; } if ( 'ü' eq "\x{00FC}" ) { print "an ü is still an ü\n"; } # latin1 string with two characters my $c3_bc = pack 'W2', 0xc3,0xbc; print "'$c3_bc':" . DBI::data_string_desc( $c3_bc ), "\n"; # Unicode string with two characters my $decoded_c3_bc = Encode::decode_utf8( $c3_bc ); print "'$decoded_c3_bc':", DBI::data_string_desc( $decoded_c3_bc ), "\n"; print Dumper( $c3_bc, $decoded_c3_bc ), "\n"; # a two character string is never equal to a single character string if ( $c3_bc eq $decoded_c3_bc ) { print "now I'm confused\n"; }
RT-Send-CC: EDAVIS [...] cpan.org, Bernhard.Schmalhofer [...] gmx.de
Hi, I am marking this bug as rejected since Bernhard's explanation seems reasonable to me. (I am not *really* the DD maintainer either, but I seem to be the only one who tends to the bug list on rt.cpan.org. It is maintained by p5p.) Martin, this is intended as any kind of disrespect. We are thankful for bug reports. If you disagree with my closing this ticket, then, by all means, feel free to reopen the ticket with a simple reply. Best regards, Steffen