Bug #56846 for MongoDB: Strings are not flagged as containing UTF-8 data

Fri Apr 23 03:09:19 2010 whatson [...] gmail.com - Ticket created

Subject:	Strings are not flagged as containing UTF-8 data
Date:	Fri, 23 Apr 2010 17:08:55 +1000
To:	bug-MongoDB [...] rt.cpan.org
From:	Andrew Whatson <whatson [...] gmail.com>

Hi, It seems that storing UTF-8 data in MongoDB works perfectly, however when it is retrieved the data is not flagged as UTF-8 data in perl. This means that you must manually call utf8::encode() on your strings once retrieving them from MongoDB, which is a pain. In contrast, DBI::Pg handles this fine and my strings are automatically flagged as containing UTF-8 data. I've attached a simple perl script to demonstrate. Regards, Andrew

Message body is not shown because sender requested not to inline it.

Fri Apr 23 11:14:31 2010 KRISTINA [...] cpan.org - Correspondence added

The driver is, in fact, flagging strings as UTF8, that's the "problem." If you do: my @data = ( 'Åland Islands', ); print "utf8? ".utf8::is_utf8($data[0])."\n"; you can see that Perl already thinks this is a UTF-8 string. If you dump it in this state, it gives you the ugly string: "\x{c5}land Islands", If you call utf8::encode($str), this actually unsets the utf8 flag so utf8::is_utf8 returns 0. If you run utf8::is_utf8 on the strings returned from the database, you'll see that they are UTF8, they just print ugly. I think it is more important to have the utf8 flag set than to have it print pretty. If $str doesn't have the utf8 flag set, has multibyte characters, and you try to find the length, you'll get the wrong value. I confess that I find this confusing and somewhat backwards, so if I'm misunderstanding something, please reopen and let me know. I've reached this conclusion based on previous bugs (http://jira.mongodb.org/browse/PERL-59) and http://www.mail- archive.com/perl-xs@perl.org/msg01784.html.

Fri Apr 23 11:14:34 2010 The RT System itself - Status changed from 'new' to 'open'

Fri Apr 23 11:14:35 2010 KRISTINA [...] cpan.org - Status changed from 'open' to 'rejected'

Mon Apr 26 20:05:10 2010 whatson [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #56846] Strings are not flagged as containing UTF-8 data
Date:	Tue, 27 Apr 2010 10:04:54 +1000
To:	bug-MongoDB [...] rt.cpan.org
From:	Andrew Whatson <whatson [...] gmail.com>

Hi Kristina, Thanks for your help. Following the references you provided has led me to understand that you are indeed correct. To make the test script above work correctly, you simply need to set the following: binmode STDOUT, ':utf8'; This prevents perl from downgrading the string on output, and everything works as expected. Note that Dumper will still output it as a mangled string (this is how perl sees it for the purpose of string length and character matching), but printing the string directly gives the desired output. I understand that this is an artifact of perl's unicode handling, but perhaps a note about this behaviour somewhere in the MongoDB driver documentation would be useful in helping other programmers with similar issues. Regards, Andrew

Mon Apr 26 20:05:15 2010 The RT System itself - Status changed from 'rejected' to 'open'

Tue Apr 27 14:10:52 2010 KRISTINA [...] cpan.org - Correspondence added

Good idea, I've added a section on strings to http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod (once the 0.33 release appears).

Tue Apr 27 14:10:57 2010 KRISTINA [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #56846 for MongoDB: Strings are not flagged as containing UTF-8 data

Preferred bug tracker

Maintainer(s)' notes