On Fri Dec 27 03:35:32 2013, andreas@andreasvoegele.com wrote:
Show quoted text> On Fri Dec 27 01:05:04 2013, RIBASUSHI wrote:
> > Hi, sorry for the delayed reply
> >
> > On Sun Nov 24 04:09:11 2013, andreas@andreasvoegele.com wrote:
> > > [...] When writing to a properly configured MySQL database and
> > > mysql_enable_utf8 set, I get an error from find_or_create().
> >
> > Can you elaborate what that error was?
>
> My program uses a web scraping module that retrieves text from the
> web. The web scraper decodes the text properly but Perl stores the
> data internally in ISO-8859-1. When writing text that contains
> umlaut characters to the database I get the following error message:
>
> DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception:
> DBD::mysql::st execute failed: Duplicate entry 'Der Hobbit 2 -
> Smaugs Ein' for key 'PRIMARY' [for Statement "INSERT INTO film (
> title) VALUES ( ? )" with ParamValues: 0='Der Hobbit 2 - Smaugs Ein
> de (3D)'] at lib/MyApp/CinemaListings/Cinema.pm line 40
>
> perlunifaq says that "the internal format is either ISO-8859-1
> (latin-1), or utf8, depending on the history of the string".
Unfortunately this doc is misleading (while technically correct). Read on for more info.
Show quoted text>
> Accordings to DBIx::Class::UTF8Column "a bug was found deep in the
> core of DBIx::Class which affects any component attempting to
> perform encoding/decoding by overloading store_column and
> get_columns. As a result of this problem create sends the original
> column values to the database, while update sends the encoded
> values."
>
This is also technically correct, however note that overloading store_column is something no other module in the wild does (for the very same reason the doc was written).
Show quoted text> It seems that DBIx::Class uses the utf8 encoded text to check
> whether the record exists but subsequently creates the record with
> Perl's internal string representation, which might be ISO-8859-1.
In the case of vanilla DBIC this is not possible. You can crank up DBI_TRACE=2 to see the exact values the DBIC->DBI->DBD::mysql chain sends to your RDBMS. In fact I would like to see the output of that myself so we can figure out what is happening.
Show quoted text> BTW, I double checked that the web scraper decodes the text properly
> and doesn't return binary strings. Actually, realizing that Perl
> stores the text internally in ISO-8859-1 took most of the time when
> debugging this problem. My test suite reads its test data from
> files instead of the web. When reading from files Perl stores the
> text internally in utf8, i.e. everything is fine when the test suite
> is run. The error occurs only in production when the text is
> retrieved from the web and internally stored in ISO-8859-1 by Perl.
Again - how the text is stored is not very relevant. How DBD::mysql interacts with strings based on its "describecolumn" intenral interface is what the culprit is. Again - will need to see that DBI_TRACE.
Show quoted text>
> > > Only after reading DBIx::Class::UTF8Columns
> >
> > This module is to never be used in new code, as per the stern
> > warning in its documentation.
>
> I do not use DBIx::Class::UTF8Columns. That's why I didn't read
> DBIx::Class::UTF8Columns at first.
>
> > > I realized that there's a long standing bug in DBIx::Class and
> > > that I have to force Perl to represent the text internally in
> > > UTF-8 as create() will treat the text as a byte sequence. I now
> > > call decode_utf8(encode_utf8($text)) in order to ensure that the
> > > UTF-8 flag is set.
> >
> > This is most definitely very very wrong. Please do get back with
> > more info about your original problem - you are papering over the
> > issue with this erroneous decode_utf8() call.
>
> This call is fine; it forces Perl to represent the given text in
> utf8 internally. encode_utf8() converts the text, which might
> internally be stored in ISO-8859-1, into a binary string.
> decode_utf8() converts the binary string back into text, which is
> now internally stored in utf8.
Once again the "internal storage" part is not something that can affect you. The way DBD::mysql behaves does. Consider the output of this oneliner:
perl -MDevel::Peek -MEncode=encode_utf8,decode_utf8 -e 'my $str = "foo"; warn "\nPlain ascii string\n"; Dump $str; $str = encode_utf8($str); warn "\nString encoded\n"; Dump $str; $str = decode_utf8($str); warn "\nString decoded\n"; Dump $str'
Show quoted text> > [...] I will keep this ticket open for some time, pending a reply
> > from you so we can diagnose the *real* issue you encountered.
>
> The real issue is that the bug described in DBIx::Class::UTF8Columns
> not only affects programs that use DBIx::Class::UTF8Columns
> or DBIx::Class::ForceUTF8 but any program unless all text that is
> written to Unicode-aware databases is internally stored in utf8.
The bug described in UTF8Columns has *nothing* to do with unicode or anything of that sort. It has to do just and only with the case of a DBIC component overriding store_column, and expecting it to be called on both update and insert at equal points. *This* is what the text is about, nothing else (I happen to be the person who wrote it ;)