Bug #87428 for DBD-mysql: data corruption: DBD::mysql ignores the utf8-flag

Mon Jul 29 22:45:01 2013 MLEHMANN [...] cpan.org - Ticket created

Subject:

data corruption: DBD::mysql ignores the utf8-flag

perl knows two internal decoding for strings - plain octets and utf-8. which encoding is used is indicated by the so-called utf8 flag. some strings can be encoded in both formats, and some strings cna be encoded only in utf-8 (when they contain character codes >255). mysql (at least in the protocol) cannot handle any characters >255. it can handle utf-8, but utf-8 contains only byte values, i.e. <= 255. unfortunately, DBD::mysql doesn't understand the internal perl string encoding, and sometimes corrupts data. here is an example string: my $str = "\xaf"; internally, this string can be encoded either as plain octets with utf-8 flag clear, or as utf-8 string with the utf8 flag set. when passed to mysql, e.g. to execute, mysql _ignores_ the utf8 flag, which corrupts the value, as the utf8 flag indicates how the in-memory bytes need to be interpreted, and mysql doesn't have this information anymore. for example, when $str is internally utf8-encoded, mysql instead receives the string "\xc2\xaf", which is rather different. since the string acts identically on the Perl level regardless of the utf8 flag (and indeed compares identically to itself regardless of the flag value), this is hard-to-debug action at a distance, as two strings that are identical to perl (compare the same, print the same etc.) are passed as two different strings by DBD::mysql. the obvious fix is to downgrade scalars before passing them to mysql. this has two effects: 1. it ensures the corretc data is always passed, regardless of the internal encoding and 2. it can warn the user when character codes >255 are used, which mysql cannot handle (the user would have to encode them to utf-8 first for example). the reason why this is rarely a big issue is that perl currently avoids upgrading the scalar in many cases, and downgrades them when it thinks performance can be helped (for example, different versions of perl encode constant strings differently depending on whether "use utf8" is in use). still, it cost me a few hours of debugging today, because I hit exactly that case, and couldn't believe that DBD::mysql hasn't been updated since the string model changed in 5.005 :/

Mon Jul 29 22:47:46 2013 MLEHMANN [...] cpan.org - Correspondence added

a slight addendum: this is a bug in DBD::mysql and not in DBI, as some databases can handle data with character codes >255 (usually unicode), so it is up to the database driver to correctly encode the data for the database.

Wed Apr 02 01:32:18 2014 victor [...] vsespb.ru - Correspondence added

On Tue Jul 30 06:45:01 2013, MLEHMANN wrote: Show quoted text

> the obvious fix is to downgrade scalars before passing them to mysql. > this has two effects: 1. it ensures the corretc data is always passed, > regardless of the internal encoding and 2. it can warn the user when > character codes >255 are used, which mysql cannot handle (the user > would have to encode them to utf-8 first for example).

This would break code which works with perl character strings and stores it in mysql (with SET NAMES UTF8 option). You can argue that such code should be written with DBI option mysql_enable_utf8=1 (and DBI/DBD should skip downgrading strings), but there would be same problem with binary data - binary data should be downgraded and DBI cannot distinct binary data (for BLOB columns etc) and character data (VARCHAR).

Wed Apr 02 01:32:19 2014 The RT System itself - Status changed from 'new' to 'open'

Wed Apr 02 14:05:21 2014 schmorp [...] schmorp.de - Correspondence added

CC:	MLEHMANN [...] cpan.org
Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Wed, 2 Apr 2014 20:05:06 +0200
To:	Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Wed, Apr 02, 2014 at 01:32:19AM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> On Tue Jul 30 06:45:01 2013, MLEHMANN wrote:

> > the obvious fix is to downgrade scalars before passing them to mysql. > > this has two effects: 1. it ensures the corretc data is always passed, > > regardless of the internal encoding and 2. it can warn the user when > > character codes >255 are used, which mysql cannot handle (the user > > would have to encode them to utf-8 first for example).

> > This would break code which works with perl character strings and stores it in mysql (with SET NAMES UTF8 option).

That is incorrect: such code would work fine as well with downgraded strings (utf-8 is a byte-encoding). If you mean code that doesn't use utf-8, but unicode strings, then it's still incorrect: such code currently suffers from the reverse problem, i.e. sometimes data would be passed as binary or latin1, sometimes as utf-8. The solution for that would be always upgrading. No matter how you turn it, DBD::mysql is simply broken w.r.t. perl strings, because it doesn't let the user chose the format. Show quoted text

> You can argue that such code should be written with DBI option > mysql_enable_utf8=1 (and DBI/DBD should skip downgrading strings), but

No, this option has nothing to do with it - set names utf8 works fine with binary data (unless DBD::mysql is even more buggy), as utf8 is binary data. The problem is indeed as I reported - DBD::mysql wasn't updated to the new string model in perl 5.6, and currently randomly corrupts data. Since this is apparently a hard to understand problem, and I don't quite know which part is unclear, let me assure you I will be happy to explain how the perl string model works, how utf-8 works and so on, but I need some clues on where the misunderstanding sits. As a primer, try to distinguish between Perl and C - in perl, strings are simply lists of characters, and since perl 5.6, these characters can have codes > 255. Internally, as an optimisation, perl has two different and incompatible representations, utf-8 encoded and byte-encoded. Both forms can hold unicode and binary data(!), the utf8 flag _only_ changes how the character codes are represented, it doesn't change their interpretation. On the Perl level, the flag value is essentially random, as semantics are not supposed toc hange depending on the utf-8 flag, and it's not specified when and how this flag changes value, so on the Perl level, you cannot reliably affect this flag except by version-specific and undocumented hackery. A similar problem exists for numbers: perl doesn't distinguish between numbers and strings, so mysql has to guess (or the user has to specify a type, which is possible with bind_param). What DBI::mysql currently does is to take perl strings and randomly either encode them in utf-8 or byte encoding, regardless of what the encoding of the string really is. Fixing this might break some code that currently depends on undocumented and version-specific perl behaviour, but it enables writing code that no longer depends on such hacks. Right now, it's impossible to reliably pass binary (or utf-8) data to mysql(!) - the rules can (and do) change in every perl version. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Fri Apr 04 07:18:14 2014 victor [...] vsespb.ru - Correspondence added

RT-Send-CC:

schmorp [...] schmorp.de

On Wed Apr 02 22:05:21 2014, schmorp@schmorp.de wrote: Show quoted text

> On Wed, Apr 02, 2014 at 01:32:19AM -0400, Victor Efimov via RT <bug- > DBD-mysql@rt.cpan.org> wrote:

> > On Tue Jul 30 06:45:01 2013, MLEHMANN wrote:

> > > the obvious fix is to downgrade scalars before passing them to > > > mysql. > > > this has two effects: 1. it ensures the corretc data is always > > > passed, > > > regardless of the internal encoding and 2. it can warn the user > > > when > > > character codes >255 are used, which mysql cannot handle (the user > > > would have to encode them to utf-8 first for example).

> > > > This would break code which works with perl character strings and > > stores it in mysql (with SET NAMES UTF8 option).

[cut] Show quoted text

> If you mean code that doesn't use utf-8, but unicode strings, then > it's > still incorrect: such code currently suffers from the reverse problem, > i.e. sometimes data would be passed as binary or latin1, sometimes as > utf-8. The solution for that would be always upgrading.

Yes, I meant character strings (unicode strings). I told that it would break existing code, and this is correct. We have such code now, it works fine because downgraded unicode strings are rare and because we use it for Russian text (which cannot be downgraded). So I would consider it broken in rare cases. But you proposal will break it in _all_ cases. Show quoted text

> > No matter how you turn it, DBD::mysql is simply broken w.r.t. perl > strings, because it doesn't let the user chose the format.

I agree - it's broken on API level. It should have different API where users can specify where is binary string and where is character string. Show quoted text

> > You can argue that such code should be written with DBI option > > mysql_enable_utf8=1 (and DBI/DBD should skip downgrading strings), > > but

> > No, this option has nothing to do with it - set names utf8 works fine > with > binary data (unless DBD::mysql is even more buggy), as utf8 is binary > data. >

Yes, right. I think you misunderstands me - actually I meant that you _could_ suggest a solution that people should not use unicode character strings without mysql_enable_utf8=1 (and this will make you proposal for downgrading strings valid when mysql_enable_utf8=0), and I explained why this would not help either - that's because even in mysql_enable_utf8=1 mode there will be binary data for binary columns that should not be upgraded. Show quoted text

> know which part is unclear, let me assure you I will be happy to > explain how > the perl string model works, how utf-8 works and so on, but I need > some clues

No, thank you, I think I already know how it works. Also FYI I am not maintainer of this module.

Fri Apr 04 07:31:55 2014 victor [...] vsespb.ru - Correspondence added

On Tue Jul 30 06:45:01 2013, MLEHMANN wrote: Show quoted text

Probably I missed that part - "(the user would have to encode them to utf-8 first for example)" - that would work, but that would too much code to encode each character strings to utf8 before passing to DBI + additionals performance costs. Also a function to encode string could ensure encoded string returned in downgraded form, so there is nothing to fix in DBI - user can implement and use such function by himself (and another one to ensude binary strings are downgraded).

Sat Apr 05 15:53:21 2014 schmorp [...] schmorp.de - Correspondence added

Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Sat, 5 Apr 2014 21:53:08 +0200
To:	Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Fri, Apr 04, 2014 at 07:18:15AM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> Yes, I meant character strings (unicode strings). I told that it would break existing code, and this is correct.

It's not, and nothign you say indicates otherwise. Show quoted text

> We have such code now, it works fine

It works fine by accident only. It might not work with older or newer perl versions, because it relies on undocumented behaviour inside the perl interpreter which can and does change in different versions. Show quoted text

> because downgraded unicode strings are rare

They are rare because you are lucky - but what happens when you hit that rare case? Does your code that works fine still work fine in these rare cases? Show quoted text

> and because we use it for Russian text (which cannot be downgraded).

Russian text can easily be downgraded, for example when it's encoded in utf-8, as required by mysql. Show quoted text

> So I would consider it broken in rare cases.

The key is that the code in question already is broken, even if you are lucky and it works except in rare cases. Show quoted text

> But you proposal will break it in _all_ cases.

Not sure, but possible. The key, again, is that the change would allow one to fix broken code such as yours. Right now, the best you cna achieve is code that happens to work "most of the time". So your proposal is to keep a bug that makes it impossible to write corretc and working code, because it makes already broken code fail deterministically. I would say that's a ridiculous proposal. Why would anybody want guaranteed brokenness? Even you admit that your code already *is* broken. And so is my own code. And there is no way to fix either until DBD::mysql is fixed. I can try various workarounds such as utf8::downgrade or upgrade, but that doesn't fix the code, it only makes it work with my current perl binary. Show quoted text

> > No matter how you turn it, DBD::mysql is simply broken w.r.t. perl > > strings, because it doesn't let the user chose the format.

> > I agree - it's broken on API level. It should have different API where users can specify where is binary string and where is character string.

Either that, or it should simply offer the same API as mysql, namely use the same encodign as the underlying c lib, just as basically any other library does on the planet (compare Compress::Zlib for example, which doesn't have this bug, and also doesn't require extra specificatrion of whether something is a text string or not). I think whoever implemented this utf-8 stuff in DBD::mysql was simply confused - utf-8 strings aren't unicode strings. Fortunately, this is not a situation that created a backwards compatibility problem, because the behaviour isn't deterministic, but effectively random. Show quoted text

> > binary data (unless DBD::mysql is even more buggy), as utf8 is binary

> > Yes, right. I think you misunderstands me - actually I meant that you > _could_ suggest a solution that people should not use unicode character > strings without mysql_enable_utf8=1 (and this will make you proposal for > downgrading strings valid when mysql_enable_utf8=0), and I explained why > this would not help either - that's because even in mysql_enable_utf8=1 > mode there will be binary data for binary columns that should not be > upgraded.

The documentation of mysql_enable_utf8 says "turning on this flag tells MySQL that incoming data should be treated as UTF-8". I don't know what the option does (apparently, it doesn't treat anything as utf-8 with this flag, right?), but as documented, yes, it's quite obvious that you can't pass in generic binary data anymore. (In fact, I suspect when you pass in utf-8 data as expected, it will be double-encoded, which would intorduce pretty obvious data corruption). Of course, this option is marked as experimental (in my copy at least), so one shouldn't be surprised if a bug is found and fixed. In any case, I don't see what mysql_enable_utf8 has to do with anything, it's clearly a useless option unless all your data is unicode (or utf-8?), and even has the potential to corrupt data even more (what happens when i pass data to a binary column and retrieve it, will it double or even triple-encoding the data in some cases? As the documentatino stands, it seems that is the case). Show quoted text

> > know which part is unclear, let me assure you I will be happy to > > explain how > > the perl string model works, how utf-8 works and so on, but I need > > some clues

> > No, thank you, I think I already know how it works.

It looks to me as if you keep confusing unicode and utf-8 strings. They are different in Perl. Show quoted text

> Also FYI I am not maintainer of this module.

I know, but the maintainer of this module could be confused by your wrong comments, so it's good to clear up the situation. Summary: your code is broken, and so is mine. You might not understand it yet, but you are suffering from this very bug, just in reverse. If this bug were fixed, we both could fix our code. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sat Apr 05 16:01:33 2014 victor [...] vsespb.ru - Correspondence added

RT-Send-CC:

schmorp [...] schmorp.de

Show quoted text

> > and because we use it for Russian text (which cannot be downgraded).

> > Russian text can easily be downgraded, for example when it's encoded > in > utf-8, as required by mysql. >

As I told, I meant unicode character strings. By "downgraded" I mean "utf8::downgrade". So Russian text cannot be utf8::downgrade'd, because all characters are above 255. So perl character strings with Russian letters are always with UTF-8 flag on.

Sat Apr 05 16:04:37 2014 schmorp [...] schmorp.de - Correspondence added

CC:	MLEHMANN [...] cpan.org
Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Sat, 5 Apr 2014 22:04:25 +0200
To:	Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Fri, Apr 04, 2014 at 07:31:56AM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> > Probably I missed that part - "(the user would have to encode them to > utf-8 first for example)" - that would work, but that would too much > code to encode each character strings

Any evidence for that claim? I don't think there is. Show quoted text

> additionals performance costs.

The data needs to be transformed either inside or outside DBD::mysql, and somehow the encoding must be specified anyways, so this would not incur any additional performance costs (but see below). The only additional costs are the code that makes the program correct, which is a required component, not something that could be optimised away. Show quoted text

> Also a function to encode string could ensure encoded string returned in > downgraded form, so there is nothing to fix in DBI

I am not sure I understand that, but a DBD::mysql that force-accepts only utf-8 with one option, and otherwise just passes through strings unchanged would work fine for me (I would simply disable the option and use utf-8 for text, and would never run into a problem). An option to allow and return unicode strings for everything "non-numerical" would probably be of more use overall, as many databases are non-binary and then it would make it convenient to use unicode strings in perl where mysql expects utf-8, and vice versa. (The numericalness can already be specified, and has to, as DBD::mysql also cannot guess, so one cannot write correct code without specifying it). Show quoted text

> user can implement and use such function by himself (and another one to > ensude binary strings are downgraded).

AFAIK, there is no way to do that in Perl. The only way to do that reliably would be in XS code inside the module that uses it, which means it *has* to be in DBD::mysql (the user cannot implement this on her own). You are probably thinking of utf8::upgrade/downgrade or the like, but these obviously cannot be sued to implement this. Their only use is to work around broken libraries such as DBD::mysql while keeping your fingers crossed that the next version of perl might not break your fix. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sat Apr 05 16:08:48 2014 schmorp [...] schmorp.de - Correspondence added

Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Sat, 5 Apr 2014 22:08:36 +0200
To:	Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Sat, Apr 05, 2014 at 04:01:33PM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> > utf-8, as required by mysql. > >

> > As I told, I meant unicode character strings. By "downgraded" I mean "utf8::downgrade". So Russian text cannot be utf8::downgrade'd, because all characters are above 255. So perl character strings with Russian letters are always with UTF-8 flag on.

Thanks for the clarification, I understand now what you meant to convey now. In your original mail, you didn't say what you refer to. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sat Apr 05 16:23:28 2014 victor [...] vsespb.ru - Correspondence added

RT-Send-CC:

schmorp [...] schmorp.de

On Sat Apr 05 23:53:21 2014, schmorp@schmorp.de wrote: Show quoted text

> > The documentation of mysql_enable_utf8 says "turning on this flag > tells MySQL > that incoming data should be treated as UTF-8". > > I don't know what the option does (apparently, it doesn't treat > anything as > utf-8 with this flag, right?), but as documented, yes, it's quite > obvious > that you can't pass in generic binary data anymore. > > (In fact, I suspect when you pass in utf-8 data as expected, it will > be > double-encoded, which would intorduce pretty obvious data corruption).

Let's see again what docs tell: === When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string === that's correct. you get perl character strings, when reading data from mysql. (except binary columns) === Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect. === That means: 1) it just issues "SET NAMES utf8" command. That's all. Nothing more. 2) It tells MySQL (mysql server daemon process, not DBD::mysql library), that data is in UTF-8. If we talk about things on MySQL daemon side, there are no "character strings" "binary strings" etc, no confusion between perl character strings with utf8 flag and data encoded in utf-8 (usually without flag). so "UTF-8" here means just what it means in MySQL documentation. It's implemented via "SET NAMES utf8" command (see (1))

Sat Apr 05 16:50:13 2014 schmorp [...] schmorp.de - Correspondence added

Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Sat, 5 Apr 2014 22:49:56 +0200
To:	Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Sat, Apr 05, 2014 at 04:23:29PM -0400, Victor Efimov via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string > === > > that's correct. you get perl character strings, when reading data from mysql. (except binary columns)

What does "necessary" mean? If it means that the utf-8 flag is turned on if the mysql string contains characters > 255, it would be correct. This could be done if mysql ensures that everything is utf-8 encoded, in which case blindly setting the utf-8 flag would work, I don't know enough about libmysqlclient and mysqld to know what really happens, but I wouldn't rely on this meaning something correct, given that DBD::mysql is *known* to have a broken implementation. Show quoted text

> Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect. > > 1) it just issues "SET NAMES utf8" command. That's all. Nothing more.

That's not what it says. It says if you turn on this flag after connect, then you need to issue the set names utf8 command. Show quoted text

> 2) It tells MySQL (mysql server daemon process, not DBD::mysql library), that data is in UTF-8.

Do you have evidence for this? The official mysql docs say this only indicates the encoding used for the sql statement, not the embedded data (which is usually interpolated, but does not have to be so). This also makes sense - numbers are typically passed as strings in protocol, but still stay numbers (not utf-8 encoded data) when the statement is interpreted. Show quoted text

> If we talk about things on MySQL daemon side, there are no "character > strings" "binary strings" etc, no confusion between perl character > strings with utf8 flag and data encoded in utf-8 (usually without flag).

The MySQL daemon certainly distinguishes between character strings and binary! "char" and "binary" are data types and treated differently in mysql. binary strings compare differently than character strings for example. What it doesn't do is to distinguish between unicode and non-unicode in the protocol, and that is exactly the problem - DBD::mysql either should not attempt to distinguish, or should have a _deterministic_ algorithm. Right now, DBD::mysql sometiems utf-8 encodes data, sometimes not for the *same* strings on the Perl level. This is simply a bug - no matter what *we* think DBD::mysql _should_ do, it doesn't do it _right now_, because there is no deterministic way to influence it from the Perl level. As I have pointed out before, and as you chose to ignore: if you disagree, tell me a deterministic way to get binary data in mysql, which works in previous, current, and future versions (as long as perl works as documented). That your program (and now also my program) happens to work with the version of perl we employ is meaningless. I want a way that works correctly, even in futrue versions of Perl. Also, having to downgrade or upgrade every string before passing it to mysql is clearly something you don't want to do, but is currently necessary as a bug workaround. Again, you are suffering form the sme bug right now, you just don't realise it yet. All the drawbacks of the workarounds you think have to be employed for a fix already have to be employed. If DBD::mysql were fixed instead, most of these hacks wouldn't be required. Show quoted text

> so "UTF-8" here means just what it means in MySQL documentation. It's > implemented via "SET NAMES utf8" command (see (1))

"just" is a weasel word. As we have just seen, mysql documentation disagrees with you, so it apparently isn't that simple :) -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sat Apr 05 16:53:47 2014 schmorp [...] schmorp.de - Correspondence added

Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Sat, 5 Apr 2014 22:53:36 +0200
To:	Victor Efimov via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

Show quoted text

> This also makes sense - numbers are typically passed as strings in > protocol, but still stay numbers (not utf-8 encoded data) when the > statement is interpreted.

What I forgot to mention, btw., is that, while the protocol distinguishes between text (MYSQL_TYPE_STRING) and binary (MYSQL_TYPE_BLOB), this doesn't apply if values are interpolated, which is still, afaik, the default way of how DBD::mysql operates. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sat Apr 05 18:37:02 2014 victor [...] vsespb.ru - Correspondence added

On Sun Apr 06 00:50:13 2014, schmorp@schmorp.de wrote: Show quoted text

> > As I have pointed out before, and as you chose to ignore: if you > disagree, > tell me a deterministic way to get binary data in mysql, which works > in previous, current, and future versions (as long as perl works as > documented). > > That your program (and now also my program) happens to work with the > version of perl we employ is meaningless. I want a way that works > correctly, even in futrue versions of Perl. >

1) new flag (let's say "mysql_enable_unicode") which turn on new API. without that flag everything works old way (let's call it "old DBI API"). 2) when sending data to DBI: - scalars treated as character strings, thus utf8::upgrad'ed before processing by old DBI API. - new exported function "binary()". binary($scalar) will return blessed object which contains reference to the scalar. when this object sent to DBI, DBI will detect the object and scalar will be utf8::downgraded before processing by old DBI API 3) when reading data from DBI: like now with mysql_enable_utf8 flag: - "SET NAMES utf8" issued. - When set, a data retrieved from a textual column type (char, varchar, etc) it will return character string. - for binary column will return binary string.

Thu Oct 01 19:28:13 2015 DBOOK [...] cpan.org - Correspondence added

This is still a problem. For example, Spreadsheet::ParseExcel tends to return strings which are not utf8 upgraded, so passing them directly to DBD::mysql with mysql_enable_utf8 enabled results in collation conflicts (Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) ...). utf8::upgrade on every string being passed "solves" the issue, but this shouldn't be needed.

Mon Oct 05 22:40:53 2015 patg [...] patg.net - Correspondence added

Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Mon, 5 Oct 2015 22:40:29 -0400
To:	bug-DBD-mysql [...] rt.cpan.org
From:	Patrick Galbraith <patg [...] patg.net>

thank you for the report! I will look at the driver and see what is needed to make this not require having to upgrade every string explicitly. Show quoted text

> On Oct 1, 2015, at 7:28 PM, Dan Book via RT <bug-DBD-mysql@rt.cpan.org> wrote: > > Queue: DBD-mysql > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > This is still a problem. For example, Spreadsheet::ParseExcel tends to return strings which are not utf8 upgraded, so passing them directly to DBD::mysql with mysql_enable_utf8 enabled results in collation conflicts (Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) ...). utf8::upgrade on every string being passed "solves" the issue, but this shouldn't be needed.

Mon Oct 05 22:47:11 2015 CAPTTOFU [...] cpan.org - Taken

Mon Oct 05 22:49:07 2015 CAPTTOFU [...] cpan.org - Correspondence added

This is an old bug and I'd like to fix it. I'm not an collation expert, so I will need to look at the other drivers to see what they do about this. Sorry for the ticket rot. On Mon Oct 05 22:40:53 2015, patg@patg.net wrote: Show quoted text

> thank you for the report! I will look at the driver and see what is > needed to make this not require having to upgrade every string > explicitly. >

> > On Oct 1, 2015, at 7:28 PM, Dan Book via RT <bug-DBD- > > mysql@rt.cpan.org> wrote: > > > > Queue: DBD-mysql > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > > > This is still a problem. For example, Spreadsheet::ParseExcel tends > > to return strings which are not utf8 upgraded, so passing them > > directly to DBD::mysql with mysql_enable_utf8 enabled results in > > collation conflicts (Illegal mix of collations > > (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) ...). > > utf8::upgrade on every string being passed "solves" the issue, but > > this shouldn't be needed.

Thu Oct 08 13:58:12 2015 DBOOK [...] cpan.org - Cc DBOOK added

Sat Oct 22 11:15:09 2016 pali [...] cpan.org - Requestor PALI added

Sat Oct 22 11:15:38 2016 pali [...] cpan.org - Correspondence added

Fix for UTF-8 support in DBD::mysql is in my pull request: https://github.com/perl5-dbi/DBD-mysql/pull/67 I would like if more people affected by UTF-8 bugs in DBD::mysql could test my changes...

Sun Oct 30 16:49:16 2016 schmorp [...] schmorp.de - Correspondence added

CC:	pali [...] cpan.org, DBOOK [...] cpan.org
Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Sun, 30 Oct 2016 21:49:03 +0100
To:	Pali via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Sat, Oct 22, 2016 at 11:15:40AM -0400, Pali via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > Fix for UTF-8 support in DBD::mysql is in my pull request: https://github.com/perl5-dbi/DBD-mysql/pull/67 > I would like if more people affected by UTF-8 bugs in DBD::mysql could test my changes...

Thanks for looking into this - I only had a cursory look into the patch, and it seems it is wrong in the "other" direction now: + else if (is_binary && SvUTF8(ph->value)) + warn("UTF-8 encoded binary field %d", i); The UTF8 flag on a scalar does NOT mean the scalar is UTF-8 encoded - the scalars "\xfc" (no utf8 flag) and "\xc3\xbc" (with utf8 flag) are the same string, and in binary both encode the octet 0xfc. Emitting a warning is wrong here, and the message is wrong as well (scalars have no encoding information on the Perl level). The patch thus requires the same workarounds needed for utf-8 for binary data now - that's the "wrong in the other direction". Basically, when utf-8 encoded data is wanted, then SvPVutf8 is the correct function, while SvPVbyte is the right function for binary data - the patch only gets the utf-8 case right (with some optimisations). I can't see whether this is inteded or not - calling str_is_nonascii on an utf-8 encoded scalar doesn't seem to make much sense to me (binary data is 8 bit wide, not 7 bit). On the other hand, this seems to be in the patch multiple times. Don't have time to try it out, and maybe I am overlooking something - again, this is just a quick scan of the patch really. However, the only way to succeed, IMHO, is to get the idea of detecting or guessing encoding from perl scalars - the UTF8 flag _never_ indicates that the string data is utf-8 encoded. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sun Oct 30 18:26:37 2016 pali [...] cpan.org - Correspondence added

On Ned Okt 30 16:49:16 2016, schmorp@schmorp.de wrote: Show quoted text

> On Sat, Oct 22, 2016 at 11:15:40AM -0400, Pali via RT <bug-DBD- > mysql@rt.cpan.org> wrote:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=87428 > > > > > Fix for UTF-8 support in DBD::mysql is in my pull request: > > https://github.com/perl5-dbi/DBD-mysql/pull/67 > > I would like if more people affected by UTF-8 bugs in DBD::mysql > > could test my changes...

> > Thanks for looking into this - I only had a cursory look into the > patch, and > it seems it is wrong in the "other" direction now: > > + else if (is_binary && SvUTF8(ph->value)) > + warn("UTF-8 encoded binary field %d", i); > > The UTF8 flag on a scalar does NOT mean the scalar is UTF-8 encoded - > the > scalars "\xfc" (no utf8 flag) and "\xc3\xbc" (with utf8 flag) are the > same > string, and in binary both encode the octet 0xfc. Emitting a warning > is > wrong here, and the message is wrong as well (scalars have no encoding > information on the Perl level).

UTF8 flag tells if internal representation of PV in scalar is stored in utf8 or not. I was thinking that "\xc3\xbc" with utf8 flag is not mean to be binary anymore as it is internally stored as utf8. If you produce binary data which have internal representation in utf8 then I think there is some problem... Show quoted text

> The patch thus requires the same workarounds needed for utf-8 for > binary > data now - that's the "wrong in the other direction".

I will think about it... But pack/unpack/vec/... functions works also on string "\xc3\xbc" with utf8 flag same as on "\xfc" without utf8 flag... Show quoted text

> Basically, when utf-8 encoded data is wanted, then SvPVutf8 is the > correct > function, while SvPVbyte is the right function for binary data - the > patch > only gets the utf-8 case right (with some optimisations). > > I can't see whether this is inteded or not - calling str_is_nonascii > on an > utf-8 encoded scalar doesn't seem to make much sense to me (binary > data is > 8 bit wide, not 7 bit). On the other hand, this seems to be in the > patch > multiple times.

This is just optimization. SvPV returns data buffer in utf8 encoded or byte (latin1) encoded based on SvUTF8 flag. But plain ASCII data are same in both those encodings, so both functions SvPVbyte and SvPVutf8 returns exactly same data in that case. Checking str_is_nonascii is just optimization if SvPVutf8 is really needed to call... Show quoted text

> Don't have time to try it out, and maybe I am overlooking something - > again, this is just a quick scan of the patch really. However, the > only > way to succeed, IMHO, is to get the idea of detecting or guessing > encoding > from perl scalars - the UTF8 flag _never_ indicates that the string > data > is utf-8 encoded.

For char* value retrieved by SvPV() call, UTF8 flag really indicates if that char* value is utf8 encoded or not. But you are right that it does not tell if perl scalar accessed by pure perl functions are utf8 encoded or are nativelly in perl. All such guessing is wrong way. Driver should get either binary scalar or string scalar.

Sun Oct 30 19:20:48 2016 schmorp [...] schmorp.de - Correspondence added

CC:	MLEHMANN [...] cpan.org, pali [...] cpan.org, DBOOK [...] cpan.org
Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Mon, 31 Oct 2016 00:20:32 +0100
To:	Pali via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

On Sun, Oct 30, 2016 at 06:26:38PM -0400, Pali via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> > wrong here, and the message is wrong as well (scalars have no encoding > > information on the Perl level).

> > UTF8 flag tells if internal representation of PV in scalar is stored in utf8 or not. I was thinking that "\xc3\xbc" with utf8 flag is not mean to be binary anymore as it is internally stored as utf8. If you produce binary data which have internal representation in utf8 then I think there is some problem...

The problem is mysql not correctly interpreting that flag, and your patch doesn't make it better because it fails (similarly to the original DBD::mysql) to implement the flag as defined by perl itself. It might be a problem (I don't think it is), but that's how perl currently works, and as long as DBD::mysql doesn't handle it as intended, it will be buggy. Show quoted text

> > The patch thus requires the same workarounds needed for utf-8 for > > binary > > data now - that's the "wrong in the other direction".

> > I will think about it... But pack/unpack/vec/... functions works also on string "\xc3\xbc" with utf8 flag same as on "\xfc" without utf8 flag...

The "but" is weird, because your patch doesn't do that, unlike pack/unpack (at least they got it right after I fixed them). The key here is to understand that your patch does't work on these two strings the same way, even though it should. Show quoted text

> > I can't see whether this is inteded or not - calling str_is_nonascii > > on an > > utf-8 encoded scalar doesn't seem to make much sense to me (binary > > data is > > 8 bit wide, not 7 bit). On the other hand, this seems to be in the > > patch > > multiple times.

> > This is just optimization. SvPV returns data buffer in utf8 encoded or byte (latin1) encoded based on SvUTF8 flag. But plain ASCII data are same in both those encodings, so both functions SvPVbyte and SvPVutf8 returns exactly same data in that case. Checking str_is_nonascii is just optimization if SvPVutf8 is really needed to call...

Are you really telling me that issuing a warning is some kind of optimisation? Because that's what the patch does after testing str_is_nonascii. That doesn't look like an optimisation to me, in fact, it is a bug :) Show quoted text

> > Don't have time to try it out, and maybe I am overlooking something - > > again, this is just a quick scan of the patch really. However, the > > only > > way to succeed, IMHO, is to get the idea of detecting or guessing > > encoding > > from perl scalars - the UTF8 flag _never_ indicates that the string > > data > > is utf-8 encoded.

> > For char* value retrieved by SvPV() call, UTF8 flag really indicates if that char* value is utf8 encoded or not.

Unfortunately no - the UTF8 flag merely indicates how the perl codepoints are stored, it doesn't say anything about whether the char * is utf8 encoded or not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag set (but they might). Show quoted text

> But you are right that it does not tell if perl scalar accessed by pure perl functions are utf8 encoded or are nativelly in perl.

Maybe you mean the right thing, but the patch is wrong and your explanations are as well. The UTF8 flag business in perl is really messy, and I wish it wasn't called "UTF8", but it really doesn't tell you anything about character encoding or whether the scalar is text or binary, it only tells you how the codepoints are stored (namely either as plain octets or in a format similar to utf-8 encoding, without being utf-8). Show quoted text

> All such guessing is wrong way. Driver should get either binary scalar or string scalar.

Exactly, the driver should handle binary and text correctly - your patch seedms to go along way towards handling text correctly. It would just be nice if it wouldn't break the binary case even more :) Greetings, -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Sun Oct 30 19:45:38 2016 pali [...] cpan.org - Correspondence added

On Ned Okt 30 19:20:48 2016, schmorp@schmorp.de wrote: Show quoted text

> On Sun, Oct 30, 2016 at 06:26:38PM -0400, Pali via RT <bug-DBD- > mysql@rt.cpan.org> wrote:

> > > wrong here, and the message is wrong as well (scalars have no > > > encoding > > > information on the Perl level).

> > > > UTF8 flag tells if internal representation of PV in scalar is stored > > in utf8 or not. I was thinking that "\xc3\xbc" with utf8 flag is not > > mean to be binary anymore as it is internally stored as utf8. If you > > produce binary data which have internal representation in utf8 then I > > think there is some problem...

> > The problem is mysql not correctly interpreting that flag, and your > patch doesn't make it better because it fails (similarly to the > original > DBD::mysql) to implement the flag as defined by perl itself. It might > be a > problem (I don't think it is), but that's how perl currently works, > and as > long as DBD::mysql doesn't handle it as intended, it will be buggy. >

> > > The patch thus requires the same workarounds needed for utf-8 for > > > binary > > > data now - that's the "wrong in the other direction".

> > > > I will think about it... But pack/unpack/vec/... functions works also > > on string "\xc3\xbc" with utf8 flag same as on "\xfc" without utf8 > > flag...

> > The "but" is weird, because your patch doesn't do that, unlike > pack/unpack > (at least they got it right after I fixed them). The key here is to > understand that your patch does't work on these two strings the same > way, > even though it should.

Yea, driver should work in same way as those functions. You are right and all those warnings are really wrong... I will try to fix code. Thank you for first review! Show quoted text

> > > I can't see whether this is inteded or not - calling > > > str_is_nonascii > > > on an > > > utf-8 encoded scalar doesn't seem to make much sense to me (binary > > > data is > > > 8 bit wide, not 7 bit). On the other hand, this seems to be in the > > > patch > > > multiple times.

> > > > This is just optimization. SvPV returns data buffer in utf8 encoded > > or byte (latin1) encoded based on SvUTF8 flag. But plain ASCII data > > are same in both those encodings, so both functions SvPVbyte and > > SvPVutf8 returns exactly same data in that case. Checking > > str_is_nonascii is just optimization if SvPVutf8 is really needed to > > call...

> > Are you really telling me that issuing a warning is some kind of > optimisation? Because that's what the patch does after testing > str_is_nonascii. That doesn't look like an optimisation to me, in > fact, it > is a bug :)

With that description I mean code pattern: valbuf= SvPV(ph->value, vallen); if (enable_utf8 && !is_binary && !SvUTF8(ph->value) && str_is_nonascii(valbuf, vallen)) { SV *tmp = sv_2mortal(newSVpvn(valbuf, vallen)); valbuf = SvPVutf8(tmp, vallen); } About warning, yes... code is wrong. Show quoted text

> >

> > > Don't have time to try it out, and maybe I am overlooking something > > > - > > > again, this is just a quick scan of the patch really. However, the > > > only > > > way to succeed, IMHO, is to get the idea of detecting or guessing > > > encoding > > > from perl scalars - the UTF8 flag _never_ indicates that the string > > > data > > > is utf-8 encoded.

> > > > For char* value retrieved by SvPV() call, UTF8 flag really indicates > > if that char* value is utf8 encoded or not.

> > Unfortunately no - the UTF8 flag merely indicates how the perl > codepoints are > stored, it doesn't say anything about whether the char * is utf8 > encoded or > not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag set > (but > they might).

When utf8 encoded SV do not have the UTF8 flag set? Do you have example? I really thought that UTF8 status flag indicate that char* returned by SvPV() is utf8 encoded. Also in perlapi is written: SvUTF8 Returns a U32 value indicating whether the SV contains UTF-8 encoded data. Call this after SvPV() in case any call to string overloading updates the internal flag. Which I understood that UTF8 status flag indicates if SvPV() buffer is utf8 encoded or not. Show quoted text

> > But you are right that it does not tell if perl scalar accessed by > > pure perl functions are utf8 encoded or are nativelly in perl.

> > Maybe you mean the right thing, but the patch is wrong and your > explanations are as well. > > The UTF8 flag business in perl is really messy, and I wish it wasn't > called "UTF8", but it really doesn't tell you anything about character > encoding or whether the scalar is text or binary, it only tells you > how > the codepoints are stored (namely either as plain octets or in a > format > similar to utf-8 encoding, without being utf-8). >

> > All such guessing is wrong way. Driver should get either binary > > scalar or string scalar.

> > Exactly, the driver should handle binary and text correctly - your > patch > seedms to go along way towards handling text correctly. It would just > be > nice if it wouldn't break the binary case even more :) > > Greetings,

Fri Nov 04 05:11:35 2016 schmorp [...] schmorp.de - Correspondence added

CC:	MLEHMANN [...] cpan.org, pali [...] cpan.org, DBOOK [...] cpan.org
Subject:	Re: [rt.cpan.org #87428] data corruption: DBD::mysql ignores the utf8-flag
Date:	Fri, 4 Nov 2016 10:11:21 +0100
To:	Pali via RT <bug-DBD-mysql [...] rt.cpan.org>
From:	Marc Lehmann <schmorp [...] schmorp.de>

Sorry for the delay, I am quite busy. On Sun, Oct 30, 2016 at 07:45:44PM -0400, Pali via RT <bug-DBD-mysql@rt.cpan.org> wrote: Show quoted text

> > Are you really telling me that issuing a warning is some kind of > > optimisation? Because that's what the patch does after testing > > str_is_nonascii. That doesn't look like an optimisation to me, in > > fact, it > > is a bug :)

> > With that description I mean code pattern:

Somewhat off-topic: most modules simply use SvPVutf8/SvPVbyte, without making a copy, so the optimisation should not normally be necessary. This normally also works, as perl itself makes a temporary copy in those cases where the scalar is not mutable, and presumably knows better, so the optimisation is probably a deoptimisation in practise, as perl does not have to scan the string in general. It is, however, correct, so you might stay with this approach if you have a reason to do it differently than other parts of perl. Show quoted text

> > not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag set > > (but > > they might).

> > When utf8 encoded SV do not have the UTF8 flag set? Do you have example? I really thought that UTF8 status flag indicate that char* returned by SvPV() is utf8 encoded.

You see, that's the problem with the flag - it simply doesn't mean anything like "trhe scalar is utf-8 encoded". First of all, perl's "UTF8" encoding isn't the same as unicode's utf-8 encoding, and second, it really only is a way of representing code points > 255 in a multibyte way. This scalar is utf-8 encoded, as matter of fact. It is also binary data, as utf-8 data is always binary: my $sv = "\xc3\xbc"; But it might or might not have the utf8 flag set (this depends on the perl version and other factors). Likewise, this scalar is utf-8 encoded: utf8::encode $sv; But it does not have the utf8 flag set. That's why it is so dangerous to use "utf8 encoded" to talk about these things, as it's never clear whether the actual data is meant or perls utf-8 like internal encoding. In my experience, it is much safer to just say upgraded or downgraded, as thenh it's much harder to subconsciously fall into this trap. Show quoted text

> Also in perlapi is written: > > SvUTF8 Returns a U32 value indicating whether the SV contains UTF-8 encoded data. Call this after SvPV() in case any call to string overloading updates the internal flag. > > Which I understood that UTF8 status flag indicates if SvPV() buffer is utf8 encoded or not.

Yeah, it's not. It's really a horrible, horrible mess. It means that the character codes inside the scalar use perls extended multibyte encoding, confusingly called utf8, but it doesn't mean the SV contains utf-8 encoded data AT ALL. And the best thing is, you know this, but let yourself get confused by the bad documentation. In case I am not clear enough (it's ghard to be clear with all these confusing documentation), a string with character code 200 ("chr 200") can have this flag set or not, but in no case is *the scalar* utf-8 encoded. Just that if the utf-8 flag is set, it means the character codes use an encoding very similar to utf-8 (for example, chr 0x200000 results in invalid utf-8 in memory, but is representable in perls encoding). So basically, what I am saying is that it isn't useful to talk about these utf8 flags in perl as if they indicated utf-8 encoding of the actual data in some way. Even people who know this regularly confuse themselves, and in my experience, you get bugs this way. Otherwise, it's great to hear that you clearly know your business around utf-8 and the patch is going to be fixed. Now the big question is how to proceed in general, as by all appearances, DBD::mysql is unmaintained and the maintainers do no longer respond to mail. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Fri Nov 04 05:51:49 2016 pali [...] cpan.org - Correspondence added

On Pia Nov 04 05:11:35 2016, schmorp@schmorp.de wrote: Show quoted text

> Sorry for the delay, I am quite busy. > > On Sun, Oct 30, 2016 at 07:45:44PM -0400, Pali via RT <bug-DBD- > mysql@rt.cpan.org> wrote:

> > > Are you really telling me that issuing a warning is some kind of > > > optimisation? Because that's what the patch does after testing > > > str_is_nonascii. That doesn't look like an optimisation to me, in > > > fact, it > > > is a bug :)

> > > > With that description I mean code pattern:

> > Somewhat off-topic: most modules simply use SvPVutf8/SvPVbyte, without > making a copy, so the optimisation should not normally be necessary.

It is not only for optimisation, it is also because SvPVbyte() croaks on "wide" characters. I do not want to introduce croaks and instead DBD::mysql show warning. If "wide" character cannot be downgraded to Latin1, then its UTF-8 representation is used. Exactly same behaviour is in print when passing wide character without :utf8 layer. Show quoted text

> This > normally also works, as perl itself makes a temporary copy in those > cases > where the scalar is not mutable, and presumably knows better, so the > optimisation is probably a deoptimisation in practise, as perl does > not > have to scan the string in general. > > It is, however, correct, so you might stay with this approach if you > have > a reason to do it differently than other parts of perl.

I think reason, to not crash existing code is really good reason. Show quoted text

> > > not. In generally, utf-8 encoded SVs do _not_ have the UTF8 flag > > > set > > > (but > > > they might).

> > > > When utf8 encoded SV do not have the UTF8 flag set? Do you have > > example? I really thought that UTF8 status flag indicate that char* > > returned by SvPV() is utf8 encoded.

> > You see, that's the problem with the flag - it simply doesn't mean > anything > like "trhe scalar is utf-8 encoded". First of all, perl's "UTF8" > encoding > isn't the same as unicode's utf-8 encoding, and second, it really only > is a > way of representing code points > 255 in a multibyte way. > > This scalar is utf-8 encoded, as matter of fact. It is also binary > data, > as utf-8 data is always binary: > > my $sv = "\xc3\xbc"; > > But it might or might not have the utf8 flag set (this depends on the > perl > version and other factors). Likewise, this scalar is utf-8 encoded: > > utf8::encode $sv; > > But it does not have the utf8 flag set. > > That's why it is so dangerous to use "utf8 encoded" to talk about > these > things, as it's never clear whether the actual data is meant or perls > utf-8 like internal encoding. > > In my experience, it is much safer to just say upgraded or downgraded, > as > thenh it's much harder to subconsciously fall into this trap.

Now I understand what you mean by your definition "utf8 encoded". Basically string scalar in perl contains sequence of numbers, where is each number represent exactly one character. And we have two different internal representation of strings (latin1 and extended utf8 resp. ebcdic and special utfebcdic) in perl and pure perl code does not see any difference between them. With "utf8 encoded" you mean that "numbers" represent utf8 sequence of octets, right? I used "utf8 encoded" term in case when macro SvPV() returns C char* which is "utf8 encoded" (not UTF-8, but perl's extended utf8). This is different! And if SvUTF8() returns true, then previous SvPV() call returns C char* which is "utf8 encoded" -- char* contains perl's extended utf8 string. SvUTF8 is sufficient condition but not necessary. As you pointed utf8::encode($sv) unset SvUTF8 flag, but SvPV() still returns char* in perl's extended utf8 encoding. Show quoted text

> > Also in perlapi is written: > > > > SvUTF8 Returns a U32 value indicating whether the SV contains UTF-8 > > encoded data. Call this after SvPV() in case any call to string > > overloading updates the internal flag. > > > > Which I understood that UTF8 status flag indicates if SvPV() buffer > > is utf8 encoded or not.

> > Yeah, it's not. It's really a horrible, horrible mess. It means that > the > character codes inside the scalar use perls extended multibyte > encoding, > confusingly called utf8, but it doesn't mean the SV contains utf-8 > encoded > data AT ALL. And the best thing is, you know this, but let yourself > get > confused by the bad documentation. > > In case I am not clear enough (it's ghard to be clear with all these > confusing documentation), a string with character code 200 ("chr 200") > can have this flag set or not, but in no case is *the scalar* utf-8 > encoded. Just that if the utf-8 flag is set, it means the character > codes > use an encoding very similar to utf-8 (for example, chr 0x200000 > results > in invalid utf-8 in memory, but is representable in perls encoding). > > So basically, what I am saying is that it isn't useful to talk about > these > utf8 flags in perl as if they indicated utf-8 encoding of the actual > data in > some way. Even people who know this regularly confuse themselves, and > in my > experience, you get bugs this way. > > Otherwise, it's great to hear that you clearly know your business > around > utf-8 and the patch is going to be fixed.

As perl scalars contain sequence of numbers, we can talk about "wide characters" and wide strings (wide character > 0xFF). I hope this is not confusing. And in C char* we can talk about (perl's extended) utf8 encoding. SvUTF8 can be used only in context of that char* data (not in pure perl context). Now I'm waiting for new DBD::mysql release because it change some code around parameter parsing (cause conflicts with my patch) and after that I rebase & publish new version of utf8 patches... Show quoted text

> Now the big question is how to proceed in general, as by all > appearances, > DBD::mysql is unmaintained and the maintainers do no longer respond to > mail.

DBD::mysql is still maintained. New versions are periodically releasing, see: https://metacpan.org/pod/DBD::mysql Last version is from OCT 20, 2016. Also security fixes (like one for CVE-2016-1246) are delivered... I do not see any problem, maintainers respond to email and also to pull requests on github. PS: you do not need to CC me in this RT. I'm automatically CCed by RT, so your explicit CC just cause that I get your emails two times :-)

Thu Dec 08 18:45:25 2016 pali [...] cpan.org - Correspondence added

Pull request is updated: https://github.com/perl5-dbi/DBD-mysql/pull/67 Now it should handle wide characters correctly. Marc Lehmann, can you look at it?

Fri Jan 06 03:56:03 2017 pali [...] cpan.org - Correspondence added

UTF-8 and Unicode fixes are now in DBD::mysql devel version 4.041_01. Please test.

Sun Jan 22 11:49:03 2017 MICHIELB [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Jan 22 11:49:04 2017 MICHIELB [...] cpan.org - Fixed in 4.041_01 added

Sat Jul 01 05:17:21 2017 pali [...] cpan.org - Correspondence added

Reopening, fix was reverted in 4.043.

Sat Jul 01 05:17:28 2017 pali [...] cpan.org - Status changed from 'resolved' to 'open'

Wed Nov 15 02:52:18 2017 MICHIELB [...] cpan.org - Correspondence added

Ticket migrated to github as https://github.com/perl5-dbi/DBD-mysql/issues/197

Wed Nov 15 02:52:19 2017 MICHIELB [...] cpan.org - Status changed from 'open' to 'resolved'