Bug #90744 for DBIx-Class: Be more clear about the UTF8 flag

Sun Nov 24 04:09:11 2013 andreas [...] andreasvoegele.com - Ticket created

Subject:

Be more clear about the UTF8 flag

It took me a while to figure that the UTF8 flag always has to be set when writing to a database that supports unicode characters. The text in DBIx::Class::Manual::Cookbook says that "it is better to avoid encoding/decoding data and to use your database's own unicode capabilities" but I was mislead by that statement. My program gets text from another source. The text is properly decoded to characters but internally not represented in UTF-8. If this text is written to a file or the console with proper encoding settings everything is fine. When writing to a properly configured MySQL database and mysql_enable_utf8 set, I get an error from find_or_create(). Only after reading DBIx::Class::UTF8Columns I realized that there's a long standing bug in DBIx::Class and that I have to force Perl to represent the text internally in UTF-8 as create() will treat the text as a byte sequence. I now call decode_utf8(encode_utf8($text)) in order to ensure that the UTF-8 flag is set. The cookbook should make it more clear that the UTF8 flag needs to be set. At first I didn't read DBIx::Class::UTF8Columns as the cookbook says that it should not be "used with a database that does correctly handle unicode" but in order to understand the problem you have to know about this bug.

Fri Dec 27 01:05:04 2013 ribasushi [...] leporine.io - Correspondence added

Hi, sorry for the delayed reply On Sun Nov 24 04:09:11 2013, andreas@andreasvoegele.com wrote: Show quoted text

> It took me a while to figure that the UTF8 flag always has to be set > when writing to a database that supports unicode characters. The text > in DBIx::Class::Manual::Cookbook says that "it is better to avoid > encoding/decoding data and to use your database's own unicode > capabilities" but I was mislead by that statement. > > My program gets text from another source. The text is properly decoded > to characters but internally not represented in UTF-8. If this text is > written to a file or the console with proper encoding settings > everything is fine. When writing to a properly configured MySQL > database and mysql_enable_utf8 set, I get an error from > find_or_create().

Can you elaborate what that error was? Show quoted text

> Only after reading DBIx::Class::UTF8Columns

This module is to never be used in new code, as per the stern warning in its documentation. Show quoted text

> I realized that there's a > long standing bug in DBIx::Class and that I have to force Perl to > represent the text internally in UTF-8 as create() will treat the text > as a byte sequence. I now call decode_utf8(encode_utf8($text)) in > order to ensure that the UTF-8 flag is set.

This is most definitely very very wrong. Please do get back with more info about your original problem - you are papering over the issue with this erroneous decode_utf8() call. Show quoted text

> > The cookbook should make it more clear that the UTF8 flag needs to be > set. At first I didn't read DBIx::Class::UTF8Columns as the cookbook > says that it should not be "used with a database that does correctly > handle unicode" but in order to understand the problem you have to > know about this bug.

Again - I think you misdiagnosed the issue. More info is needed to act on this. I will keep this ticket open for some time, pending a reply from you so we can diagnose the *real* issue you encountered. Cheers

Fri Dec 27 01:05:06 2013 The RT System itself - Status changed from 'new' to 'open'

Fri Dec 27 03:35:32 2013 andreas [...] andreasvoegele.com - Correspondence added

From:

andreas [...] andreasvoegele.com

On Fri Dec 27 01:05:04 2013, RIBASUSHI wrote: Show quoted text

> Hi, sorry for the delayed reply > > On Sun Nov 24 04:09:11 2013, andreas@andreasvoegele.com wrote:

> > [...] When writing to a properly configured MySQL database and > > mysql_enable_utf8 set, I get an error from find_or_create().

> > Can you elaborate what that error was?

My program uses a web scraping module that retrieves text from the web. The web scraper decodes the text properly but Perl stores the data internally in ISO-8859-1. When writing text that contains umlaut characters to the database I get the following error message: DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::mysql::st execute failed: Duplicate entry 'Der Hobbit 2 - Smaugs Ein' for key 'PRIMARY' [for Statement "INSERT INTO film ( title) VALUES ( ? )" with ParamValues: 0='Der Hobbit 2 - Smaugs Ein de (3D)'] at lib/MyApp/CinemaListings/Cinema.pm line 40 perlunifaq says that "the internal format is either ISO-8859-1 (latin-1), or utf8, depending on the history of the string". Accordings to DBIx::Class::UTF8Column "a bug was found deep in the core of DBIx::Class which affects any component attempting to perform encoding/decoding by overloading store_column and get_columns. As a result of this problem create sends the original column values to the database, while update sends the encoded values." It seems that DBIx::Class uses the utf8 encoded text to check whether the record exists but subsequently creates the record with Perl's internal string representation, which might be ISO-8859-1. BTW, I double checked that the web scraper decodes the text properly and doesn't return binary strings. Actually, realizing that Perl stores the text internally in ISO-8859-1 took most of the time when debugging this problem. My test suite reads its test data from files instead of the web. When reading from files Perl stores the text internally in utf8, i.e. everything is fine when the test suite is run. The error occurs only in production when the text is retrieved from the web and internally stored in ISO-8859-1 by Perl. Show quoted text

> > Only after reading DBIx::Class::UTF8Columns

> > This module is to never be used in new code, as per the stern > warning in its documentation.

I do not use DBIx::Class::UTF8Columns. That's why I didn't read DBIx::Class::UTF8Columns at first. Show quoted text

> > I realized that there's a long standing bug in DBIx::Class and > > that I have to force Perl to represent the text internally in > > UTF-8 as create() will treat the text as a byte sequence. I now > > call decode_utf8(encode_utf8($text)) in order to ensure that the > > UTF-8 flag is set.

> > This is most definitely very very wrong. Please do get back with > more info about your original problem - you are papering over the > issue with this erroneous decode_utf8() call.

This call is fine; it forces Perl to represent the given text in utf8 internally. encode_utf8() converts the text, which might internally be stored in ISO-8859-1, into a binary string. decode_utf8() converts the binary string back into text, which is now internally stored in utf8. Show quoted text

> [...] I will keep this ticket open for some time, pending a reply > from you so we can diagnose the *real* issue you encountered.

The real issue is that the bug described in DBIx::Class::UTF8Columns not only affects programs that use DBIx::Class::UTF8Columns or DBIx::Class::ForceUTF8 but any program unless all text that is written to Unicode-aware databases is internally stored in utf8.

Fri Dec 27 04:32:12 2013 ribasushi [...] leporine.io - Correspondence added

On Fri Dec 27 03:35:32 2013, andreas@andreasvoegele.com wrote: Show quoted text

> On Fri Dec 27 01:05:04 2013, RIBASUSHI wrote:

> > Hi, sorry for the delayed reply > > > > On Sun Nov 24 04:09:11 2013, andreas@andreasvoegele.com wrote:

> > > [...] When writing to a properly configured MySQL database and > > > mysql_enable_utf8 set, I get an error from find_or_create().

> > > > Can you elaborate what that error was?

> > My program uses a web scraping module that retrieves text from the > web. The web scraper decodes the text properly but Perl stores the > data internally in ISO-8859-1. When writing text that contains > umlaut characters to the database I get the following error message: > > DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: > DBD::mysql::st execute failed: Duplicate entry 'Der Hobbit 2 - > Smaugs Ein' for key 'PRIMARY' [for Statement "INSERT INTO film ( > title) VALUES ( ? )" with ParamValues: 0='Der Hobbit 2 - Smaugs Ein > de (3D)'] at lib/MyApp/CinemaListings/Cinema.pm line 40 > > perlunifaq says that "the internal format is either ISO-8859-1 > (latin-1), or utf8, depending on the history of the string".

Unfortunately this doc is misleading (while technically correct). Read on for more info. Show quoted text

> > Accordings to DBIx::Class::UTF8Column "a bug was found deep in the > core of DBIx::Class which affects any component attempting to > perform encoding/decoding by overloading store_column and > get_columns. As a result of this problem create sends the original > column values to the database, while update sends the encoded > values." >

This is also technically correct, however note that overloading store_column is something no other module in the wild does (for the very same reason the doc was written). Show quoted text

> It seems that DBIx::Class uses the utf8 encoded text to check > whether the record exists but subsequently creates the record with > Perl's internal string representation, which might be ISO-8859-1.

In the case of vanilla DBIC this is not possible. You can crank up DBI_TRACE=2 to see the exact values the DBIC->DBI->DBD::mysql chain sends to your RDBMS. In fact I would like to see the output of that myself so we can figure out what is happening. Show quoted text

> BTW, I double checked that the web scraper decodes the text properly > and doesn't return binary strings. Actually, realizing that Perl > stores the text internally in ISO-8859-1 took most of the time when > debugging this problem. My test suite reads its test data from > files instead of the web. When reading from files Perl stores the > text internally in utf8, i.e. everything is fine when the test suite > is run. The error occurs only in production when the text is > retrieved from the web and internally stored in ISO-8859-1 by Perl.

Again - how the text is stored is not very relevant. How DBD::mysql interacts with strings based on its "describecolumn" intenral interface is what the culprit is. Again - will need to see that DBI_TRACE. Show quoted text

> > > Only after reading DBIx::Class::UTF8Columns

> > > > This module is to never be used in new code, as per the stern > > warning in its documentation.

> > I do not use DBIx::Class::UTF8Columns. That's why I didn't read > DBIx::Class::UTF8Columns at first. >

> > > I realized that there's a long standing bug in DBIx::Class and > > > that I have to force Perl to represent the text internally in > > > UTF-8 as create() will treat the text as a byte sequence. I now > > > call decode_utf8(encode_utf8($text)) in order to ensure that the > > > UTF-8 flag is set.

> > > > This is most definitely very very wrong. Please do get back with > > more info about your original problem - you are papering over the > > issue with this erroneous decode_utf8() call.

> > This call is fine; it forces Perl to represent the given text in > utf8 internally. encode_utf8() converts the text, which might > internally be stored in ISO-8859-1, into a binary string. > decode_utf8() converts the binary string back into text, which is > now internally stored in utf8.

Once again the "internal storage" part is not something that can affect you. The way DBD::mysql behaves does. Consider the output of this oneliner: perl -MDevel::Peek -MEncode=encode_utf8,decode_utf8 -e 'my $str = "foo"; warn "\nPlain ascii string\n"; Dump $str; $str = encode_utf8($str); warn "\nString encoded\n"; Dump $str; $str = decode_utf8($str); warn "\nString decoded\n"; Dump $str' Show quoted text

> > [...] I will keep this ticket open for some time, pending a reply > > from you so we can diagnose the *real* issue you encountered.

> > The real issue is that the bug described in DBIx::Class::UTF8Columns > not only affects programs that use DBIx::Class::UTF8Columns > or DBIx::Class::ForceUTF8 but any program unless all text that is > written to Unicode-aware databases is internally stored in utf8.

The bug described in UTF8Columns has *nothing* to do with unicode or anything of that sort. It has to do just and only with the case of a DBIC component overriding store_column, and expecting it to be called on both update and insert at equal points. *This* is what the text is about, nothing else (I happen to be the person who wrote it ;)

Fri Dec 27 07:11:56 2013 andreas [...] andreasvoegele.com - Correspondence added

From:

andreas [...] andreasvoegele.com

On Fri Dec 27 04:32:12 2013, RIBASUSHI wrote: Show quoted text

> On Fri Dec 27 03:35:32 2013, andreas@andreasvoegele.com wrote:

> > It seems that DBIx::Class uses the utf8 encoded text to check > > whether the record exists but subsequently creates the record > > with Perl's internal string representation, which might be > > ISO-8859-1.

> > In the case of vanilla DBIC this is not possible. You can crank up > DBI_TRACE=2 to see the exact values the DBIC->DBI->DBD::mysql > chain sends to your RDBMS. In fact I would like to see the output > of that myself so we can figure out what is happening.

I've attached two example Perl scripts and their output. One script is encoded in UTF-8 and the other one in ISO-8859-1. Both scripts insert the text "Grüß Gott" into a table. MySQL and MariaDB cut strings that contain invalid characters. Thus, the ISO-8859-1 version of the script only inserts the text "Gr". I see differences in the scripts' output when I search for "bind_param for DBD::mysql::st".

Subject:

test-dbix-class-utf8.output

Download test-dbix-class-utf8.output
application/octet-stream 9.2k

Message body not shown because it is not plain text.

Subject:

test-dbix-class-utf8.pl

#!/usr/bin/perl # mysql -h localhost -u root -p # CREATE DATABASE greetings CHARACTER SET utf8; # CREATE USER 'greetings'@'localhost' IDENTIFIED BY 'Cahphah3'; # GRANT INSERT,SELECT,UPDATE,DELETE ON greetings.* to 'greetings'@'localhost'; # USE greetings; # CREATE TABLE greetings (text varchar(255) PRIMARY KEY); # # Settings in /etc/my.cnf: # # [client] # default-character-set = utf8 # # [server] # collation-server = utf8_unicode_ci # init-connect='SET NAMES utf8' # character-set-server = utf8 use strict; use warnings; use utf8; my $greeting = 'GrÃ¼Ã Gott'; my $dsn = 'DBI:mysql:database=greetings;host=localhost'; my $user = 'greetings'; my $pw = 'Cahphah3'; my $args = { mysql_enable_utf8 => 1 }; package Greetings::Schema::Result::Greetings; use parent qw(DBIx::Class::Core); __PACKAGE__->table('greetings'); __PACKAGE__->add_columns( 'text' => { data_type => 'varchar', size => 255 }, ); __PACKAGE__->set_primary_key('text'); package Greetings::Schema; use parent qw(DBIx::Class::Schema); __PACKAGE__->register_class( 'Greetings', 'Greetings::Schema::Result::Greetings' ); package main; my $schema = Greetings::Schema->connect( $dsn, $user, $pw, $args ); for ( 1 .. 2 ) { $schema->resultset('Greetings')->find_or_create( { text => $greeting } ); }

Subject:

test-dbix-class-latin1.pl

#!/usr/bin/perl # mysql -h localhost -u root -p # CREATE DATABASE greetings CHARACTER SET utf8; # CREATE USER 'greetings'@'localhost' IDENTIFIED BY 'Cahphah3'; # GRANT INSERT,SELECT,UPDATE,DELETE ON greetings.* to 'greetings'@'localhost'; # USE greetings; # CREATE TABLE greetings (text varchar(255) PRIMARY KEY); # # Settings in /etc/my.cnf: # # [client] # default-character-set = utf8 # # [server] # collation-server = utf8_unicode_ci # init-connect='SET NAMES utf8' # character-set-server = utf8 use strict; use warnings; my $greeting = 'Grüß Gott'; my $dsn = 'DBI:mysql:database=greetings;host=localhost'; my $user = 'greetings'; my $pw = 'Cahphah3'; my $args = { mysql_enable_utf8 => 1 }; package Greetings::Schema::Result::Greetings; use parent qw(DBIx::Class::Core); __PACKAGE__->table('greetings'); __PACKAGE__->add_columns( 'text' => { data_type => 'varchar', size => 255 }, ); __PACKAGE__->set_primary_key('text'); package Greetings::Schema; use parent qw(DBIx::Class::Schema); __PACKAGE__->register_class( 'Greetings', 'Greetings::Schema::Result::Greetings' ); package main; my $schema = Greetings::Schema->connect( $dsn, $user, $pw, $args ); for ( 1 .. 2 ) { $schema->resultset('Greetings')->find_or_create( { text => $greeting } ); }

Subject:

test-dbix-class-latin1.output

Download test-dbix-class-latin1.output
application/octet-stream 14.7k

Message body not shown because it is not plain text.

Fri Dec 27 12:33:23 2013 ribasushi [...] leporine.io - Correspondence added

Subject:	Re: [rt.cpan.org #90744] Be more clear about the UTF8 flag
Date:	Fri, 27 Dec 2013 17:33:07 +0000
To:	Andreas Voegele via RT <bug-DBIx-Class [...] rt.cpan.org>
From:	Peter Rabbitson <ribasushi [...] cpan.org>

On Fri, Dec 27, 2013 at 07:11:59AM -0500, Andreas Voegele via RT wrote: Show quoted text

> Queue: DBIx-Class > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=90744 > > > On Fri Dec 27 04:32:12 2013, RIBASUSHI wrote:

> > On Fri Dec 27 03:35:32 2013, andreas@andreasvoegele.com wrote:

> > > It seems that DBIx::Class uses the utf8 encoded text to check > > > whether the record exists but subsequently creates the record > > > with Perl's internal string representation, which might be > > > ISO-8859-1.

> > > > In the case of vanilla DBIC this is not possible. You can crank up > > DBI_TRACE=2 to see the exact values the DBIC->DBI->DBD::mysql > > chain sends to your RDBMS. In fact I would like to see the output > > of that myself so we can figure out what is happening.

> > I've attached two example Perl scripts and their output. One script > is encoded in UTF-8 and the other one in ISO-8859-1. Both scripts > insert the text "Grüß Gott" into a table. > > MySQL and MariaDB cut strings that contain invalid characters. > Thus, the ISO-8859-1 version of the script only inserts the text > "Gr".

Aha, so I hope you see now that this is not a DBIC issue, and in fact this isn't a DBD issue either (DBD::mysql behaves quite properly). Attached is a script which demonstrates your problem much more concisely, and shows how to prevent the problem from occuring silently. Also note that DBIC could help you with that via the shortcut: https://metacpan.org/pod/DBIx::Class::Storage::DBI::mysql#set_strict_mode I am closing this ticket since the entire chain behaves just as expected, you simply didn't do the equivalent of perl's "use strict". Feel free to continue the discussion if you have remaining questions/concerns. Cheers

Message body is not shown because sender requested not to inline it.

Fri Dec 27 12:34:54 2013 ribasushi [...] leporine.io - Status changed from 'open' to 'resolved'