Bug #32479 for DBD-Pg: DBD-Pg 2.0.0

Mon Jan 21 10:58:11 2008 aw [...] ice-sa.com - Ticket created

Subject:	DBD-Pg 2.0.0_4, utf8 array data
Date:	Mon, 21 Jan 2008 16:57:06 +0100
To:	bug-DBD-Pg [...] rt.cpan.org
From:	André Warnier <aw [...] ice-sa.com>

Hi. There seems to be a problem in the handling of INSERT for utf-8 text[] array data. The platform : # DBI Version 1.601 # DBD::Pg Version 2.0.0_4 # Perl Version 5.8.8 # OS linux platform : Linux arthur 2.6.18-4-686 #1 SMP Wed May 9 23:03:12 UTC 2007 i686 GNU/Linux (Debian Etch) Postgres version() : PostgreSQL 8.1.11 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) The error message : [2008/01/21-13:18:40] [warning] DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfe HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". This occurs when executing a prepared statement, loading some text data into a text[] column, when the text data of one or more of the array contains non-US-ASCII characters. The target Pg database was created with an encoding of UTF8. The process loading the data via DBD-Pg has LANG and LC_CTYPE set to "en_GB.utf8". (using another locale does not change the problem). I have also tried : 1) to set the attribute { pg_enable_utf8 => 1 } in the DBI->connect() call 2) after the connect call, to do : binmode($dbh->{pg_socket},':utf8'); but none of those seem to make a difference. Description of program : The program reads an iso-8859-1 text data file, and loads this data first into a Perl Hashtable, and from there into a Pg table via a prepared and executed INSERT statement. The original data, and the columns of the Pg table are type 'text', or text[] arrays. The data is read into Perl via a filhandle opened with the proper I/O layer, like : open(IN,'<:encoding('iso-8859-1',$file) The data loading works fine, as long as the elements of the text arrays do not contain multi-byte utf8 data (chr codes below 128 decimal). As soon as one element contains a multi-byte utf8 character however, the message above appears and the execute() fails. The offending data looks originally like this : [sbed element] = Arbeitsschutz [sbed element] = Atemschutz [sbed element] = Gruppeneinteilung [sbed element] = Kenntlichmachung [sbed element] = Markierung [sbed element] = Schutzausrüstung [sbed element] = Schutzgerät [sbee element] = Classification [sbee element] = Marking [sbee element] = Markings [sbee element] = Occupational safety [sbee element] = Protective equipment [sbee element] = Respiratory protections [sbef element] = Sécurité industrielle [sbef element] = Hygiène du travail [sbef element] = Sécurité du travail [sbef element] = Protection respiratoire [sbef element] = Marquage [sbef element] = Equipement de protection If, prior to loading the data into Pg, I test it in my program with the Encode::is_utf8() function, I can confirm that as far as Perl is concerned, this is properly utf8-marked data. If, prior to loading this data into Pg, I modify the strings above with a regexp to replace the various "diacritic" characters by for example a "-" (minus) sign, then the data loads fine. The execute statment is as follows : $sth->execute(@values_list) where @values_list is an array contain elements that are either - single scalar text variables or - references to arrays whose element contain individual text strings. The problem above does not seem to appear for non-array 'text' columns, even when the data contains utf8 multi-byte characters, and only appears for array-type data and columns. I am quite experienced at using utf8 data in Perl, and converting to/from various encodings. Best regards, André

Mon Jan 21 23:50:09 2008 greg [...] turnstep.com - Correspondence added

Thanks for the detailed report. This should be fixed as of r10631. There are also some new tests in t/09array.t: it would be much appreciated if someone good with utf-8 (e.g. not me) would look them over.

Mon Jan 21 23:50:11 2008 The RT System itself - Status changed from 'new' to 'open'

Tue Jan 22 11:28:13 2008 aw [...] ice-sa.com - Correspondence added

Subject:	Re: [rt.cpan.org #32479] DBD-Pg 2.0.0_4, utf8 array data
Date:	Tue, 22 Jan 2008 17:26:41 +0100
To:	bug-DBD-Pg [...] rt.cpan.org
From:	André Warnier <aw [...] ice-sa.com>

Hi. Greg Sabino Mullane via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=32479 > > > Thanks for the detailed report. This should be fixed as of r10631. There > are also some new tests in t/09array.t: it would be much appreciated if > someone good with utf-8 (e.g. not me) would look them over. >

r10631 looks good. I have loaded several thousand rows with utf-8 text scalar and array data, and so far it seems ok. Many thanks for the quick fix. About the tests : I do have some good experience handling utf-8 data and conversions to/from it in perl, but I am no wizard programmer, and a real beginner regarding DBI, DBD, and relational DBMS systems in general. Thus the test code in t/09array.t is frankly beyond my capacities. All I can say is that the creation of a perl utf-8 string via my $utf8_str = chr(0x100).'dam'; will work no matter which locale the script is running under (which is not as obvious as it seems). As long as this string is not modified in any way, perl will not "downgrade" it back to non-utf8. But you modify the string (such as adding quotes), then you have to be careful. I believe that perl might, in some cases, discard the original string and create a new one which may not have the utf8 flag set. Because of the character chosen, this may never happen under an iso-latin-1 locale, but might happen if the script is run under a locale where this character is a "byte" in the current locale. The way to check if the resulting string is still utf8-encoded (or, rather, marked as such), is to test it with "Encode::is_utf8($string)". As a more general note : I find that there is some ambiguity regarding the 'pg_enable_utf8' attribute, and even wonder if it should be there at all. I know this is a messy subject, and I hope I will not mess up the following explanation of what I mean. Perl has two internal representations of strings : a string is either a collection of "bytes", or a collection of "characters". In that last case, the characters are in utf8 encoding, and the string is internally marked with "the utf8 flag on". For the sake of what follows, let's just assume that Perl does it right, always knows the encoding of one of it's internal string values, and converts from the one to the other appropriately whenever there is a need for it. For text data that perl reads in from the outside, perl will convert (or not) the external data to one of its internal representations, according to the "I/O layer" specified for the corresponding filehandle. (like "open(FH,'<:utf8',$file)") In this case, the correct layer to use must be specified by the programmer, because perl has no way to know the encoding of this external data. In other words, if you tell perl that the external source is data in UTF-8, then perl will treat all data received from there as UTF-8, convert it or not while reading it, and mark it accordingly in the internal variables in which it is read. In the case of data read from Pg into Perl, - perl similarly always needs to know the encoding of any data which it reads, so as to be able to convert it correctly into its internal representation (*). - the encoding of the incoming Pg data should be fixed by the encoding in which this data is stored in the Pg database - unless I am mistaken, Pg always "knows" the encoding of the data stored in the columns of any given database, and DBD::Pg can find out by itself, and set up the appropriate conversion I/O layer if needed. (This does not only apply to UTF8, but to any encoding supported by Pg). On the perl->to->Pg side : The perl script has some data in either one of its internal representations, which it wishes to write into a Pg table, via DBD::Pg. If the Pg database is defined as being in some encoding, then writing data to the db using another encoding would always be invalid. If perl knows the correct encoding to use for the database, it can convert any of its internal strings to the appropriate encoding for that Pg database. DBD::Pg can find out which is the appropriate encoding for the target database. Ergo, there should be no need for any separate attribute to tell perl or DBD::Pg which conversion to use on the way from perl to Pg. Just about the only reason I can think of, for an attribute or a setting to be used is if, for some unusual reason, one wanted to tell DBD::Pg to *not* do any character translations at all, and consider all the data as bytes (the equivalent of the ":raw" layer of perl). And maybe this could be 2 attributes : - perl_to_pg_raw - pg_to_perl_raw (as 'in' or 'out' can be confusing to mere mortals like me) so as to be able to turn this on or off separately. I could see this as useful if one wanted to do some "bulk load" or "bulk unload" of data to/from Pg, and knew in advance that no translation is needed either way, to avoid the translation overhead. (*) the main question in my mind with the "pg_enable_utf8" attribute is the following : Suppose I create a perl script that accesses a Pg database and reads rows from it. Suppose the database has the UTF8 encoding (**), but I don't know this, so I don't set the "pg_enable_utf8" attribute (**). What happens when I read data from the database ? In fact, why should I have to know the database encoding, unless my needs are really special? (I mean, considering that Perl and DBD::Pg could easily figure out by themselves and do the appropriate conversions) (**) or any other encoding-related setting I can easily imagine that things might not be as simple as setting up the filehandles to Pg with the appropriate encoding layers. There may also be a host of unintended interferences with DBI, or Proxy modules e.g. Maybe an initial step could be - to provide an easy call in DBD::Pg to get the internal database encoding (with possibly a translation between the Pg encoding codes to the corresponding Perl Encoding charset names) - to provide explicit function calls to set the encoding layer to use (or attributes to the connect()). Best regards, André

Tue Jan 22 12:36:51 2008 david [...] kineticode.com - Correspondence added

Subject:	Re: [rt.cpan.org #32479] DBD-Pg 2.0.0_4, utf8 array data
Date:	Tue, 22 Jan 2008 09:36:10 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <david [...] kineticode.com>

On Jan 22, 2008, at 08:28, André Warnier via RT wrote: Show quoted text

>> Thanks for the detailed report. This should be fixed as of r10631. >> There >> are also some new tests in t/09array.t: it would be much >> appreciated if >> someone good with utf-8 (e.g. not me) would look them over.

I added a few more tests, but it looks good to me. Best, David

Tue Jan 22 19:35:38 2008 greg [...] turnstep.com - Correspondence added

CC:	dbd-pg [...] perl.org
CC:	dbd-pg [...] perl.org
Subject:	Re: [rt.cpan.org #32479] DBD-Pg 2.0.0_4, utf8 array data
Date:	Wed, 23 Jan 2008 00:36:41 -0000
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"Greg Sabino Mullane" <greg [...] turnstep.com>

-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Show quoted text

> Many thanks for the quick fix.

Your welcome. Show quoted text

> The way to check if the resulting string is still utf8-encoded (or, > rather, marked as such), is to test it with "Encode::is_utf8($string)".

David applied those checks, thanks to you both for that. Show quoted text

> I find that there is some ambiguity regarding the 'pg_enable_utf8' > attribute, and even wonder if it should be there at all.

Yeah, I'm not sure either, but it may be due to the fact that we have to support Perl 5.6 as well. ...snip lots of good stuff... Show quoted text

> I could see this as useful if one wanted to do some "bulk load" > or "bulk unload" of data to/from Pg, and knew in advance that no > translation is needed either way, to avoid the translation overhead.

Yes, could be that too. Show quoted text

> - to provide an easy call in DBD::Pg to get the internal database > encoding (with possibly a translation between the Pg encoding codes to > the corresponding Perl Encoding charset names)

Yes, that should be easy enough. Show quoted text

> - to provide explicit function calls to set the encoding layer to use > (or attributes to the connect()).

Likewise. Thanks. I'm hesistant to do anything more than fix bugs at this point, but we definitely need to hash out the whole utf-8 business in a future release. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200801221933 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFHlot6vJuQZxSWSsgRAyoAAKC/eSGMTOn8FDBkVBtSRNloZX71qQCggJdC ig9/pcozUZhPHoX26XMi2Wc= =veKq -----END PGP SIGNATURE-----

Tue Jan 22 19:50:45 2008 greg [...] turnstep.com - Correspondence added

CC:	dbd-pg [...] perl.org
CC:	dbd-pg [...] perl.org
Subject:	Re: [rt.cpan.org #32479] DBD-Pg 2.0.0_4, utf8 array data
Date:	Wed, 23 Jan 2008 00:36:41 -0000
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"Greg Sabino Mullane" <greg [...] turnstep.com>

-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Show quoted text

> Many thanks for the quick fix.

Your welcome. Show quoted text

> The way to check if the resulting string is still utf8-encoded (or, > rather, marked as such), is to test it with "Encode::is_utf8($string)".

David applied those checks, thanks to you both for that. Show quoted text

> I find that there is some ambiguity regarding the 'pg_enable_utf8' > attribute, and even wonder if it should be there at all.

Yeah, I'm not sure either, but it may be due to the fact that we have to support Perl 5.6 as well. ...snip lots of good stuff... Show quoted text

> I could see this as useful if one wanted to do some "bulk load" > or "bulk unload" of data to/from Pg, and knew in advance that no > translation is needed either way, to avoid the translation overhead.

Yes, could be that too. Show quoted text

> - to provide an easy call in DBD::Pg to get the internal database > encoding (with possibly a translation between the Pg encoding codes to > the corresponding Perl Encoding charset names)

Yes, that should be easy enough. Show quoted text

> - to provide explicit function calls to set the encoding layer to use > (or attributes to the connect()).

Likewise. Thanks. I'm hesistant to do anything more than fix bugs at this point, but we definitely need to hash out the whole utf-8 business in a future release. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200801221933 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFHlot6vJuQZxSWSsgRAyoAAKC/eSGMTOn8FDBkVBtSRNloZX71qQCggJdC ig9/pcozUZhPHoX26XMi2Wc= =veKq -----END PGP SIGNATURE-----

Tue Jan 22 19:51:50 2008 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #32479] DBD-Pg 2.0.0_4, utf8 array data
Date:	Tue, 22 Jan 2008 16:51:25 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Jan 22, 2008, at 16:35, Greg Sabino Mullane via RT wrote: Show quoted text

>> I find that there is some ambiguity regarding the 'pg_enable_utf8' >> attribute, and even wonder if it should be there at all.

> > Yeah, I'm not sure either, but it may be due to the fact that we have > to support Perl 5.6 as well.

I think that it's there for those of us who only use UTF-8 in the database, and so just want the flag to always be turned on, without doing the decoding. Show quoted text

> ...snip lots of good stuff...

Yes, I agree that it would be ideal if the driver could tell what encoding to use for the database. I discussed the utf8 issue with Tim Bunce on the DBI developers list a few years ago, and he seemed interested in having some sort of flag that would indicate whether or not data was utf8, similar to the pg_enable_utf8 flag (bug smarter), but this idea of yours, André, is an even better one. It might be worth bringing up on dbi-dev as a proposal for DBI in general, since all of the drivers have this problem, and all of the databases know their encodings. Best, David

Mon Feb 11 09:54:48 2008 greg [...] turnstep.com - Fixed in 2.0.0 added

Mon Feb 11 09:54:48 2008 greg [...] turnstep.com - Correspondence added

I'm going to resolve this particular bug, and open a new "wishlist" item about better utf8 support in general.

Mon Feb 11 09:54:50 2008 greg [...] turnstep.com - Status changed from 'open' to 'resolved'

Bug #32479 for DBD-Pg: DBD-Pg 2.0.0_4, utf8 array data