Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the MongoDB CPAN distribution.

Maintainer(s)' notes

Please don't report bugs here. Please use the MongoDB Perl driver issue tracker instead.

Report information
The Basics
Id: 56846
Status: resolved
Priority: 0/
Queue: MongoDB

People
Owner: Nobody in particular
Requestors: whatson [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Strings are not flagged as containing UTF-8 data
Date: Fri, 23 Apr 2010 17:08:55 +1000
To: bug-MongoDB [...] rt.cpan.org
From: Andrew Whatson <whatson [...] gmail.com>
Hi, It seems that storing UTF-8 data in MongoDB works perfectly, however when it is retrieved the data is not flagged as UTF-8 data in perl. This means that you must manually call utf8::encode() on your strings once retrieving them from MongoDB, which is a pain. In contrast, DBI::Pg handles this fine and my strings are automatically flagged as containing UTF-8 data. I've attached a simple perl script to demonstrate. Regards, Andrew

Message body is not shown because sender requested not to inline it.

The driver is, in fact, flagging strings as UTF8, that's the "problem." If you do: my @data = ( 'Ă…land Islands', ); print "utf8? ".utf8::is_utf8($data[0])."\n"; you can see that Perl already thinks this is a UTF-8 string. If you dump it in this state, it gives you the ugly string: "\x{c5}land Islands", If you call utf8::encode($str), this actually unsets the utf8 flag so utf8::is_utf8 returns 0. If you run utf8::is_utf8 on the strings returned from the database, you'll see that they are UTF8, they just print ugly. I think it is more important to have the utf8 flag set than to have it print pretty. If $str doesn't have the utf8 flag set, has multibyte characters, and you try to find the length, you'll get the wrong value. I confess that I find this confusing and somewhat backwards, so if I'm misunderstanding something, please reopen and let me know. I've reached this conclusion based on previous bugs (http://jira.mongodb.org/browse/PERL-59) and http://www.mail- archive.com/perl-xs@perl.org/msg01784.html.
Subject: Re: [rt.cpan.org #56846] Strings are not flagged as containing UTF-8 data
Date: Tue, 27 Apr 2010 10:04:54 +1000
To: bug-MongoDB [...] rt.cpan.org
From: Andrew Whatson <whatson [...] gmail.com>
Hi Kristina, Thanks for your help. Following the references you provided has led me to understand that you are indeed correct. To make the test script above work correctly, you simply need to set the following: binmode STDOUT, ':utf8'; This prevents perl from downgrading the string on output, and everything works as expected. Note that Dumper will still output it as a mangled string (this is how perl sees it for the purpose of string length and character matching), but printing the string directly gives the desired output. I understand that this is an artifact of perl's unicode handling, but perhaps a note about this behaviour somewhere in the MongoDB driver documentation would be useful in helping other programmers with similar issues. Regards, Andrew
Good idea, I've added a section on strings to http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod (once the 0.33 release appears).