Subject: | DBD-Pg 2.0.0_4, utf8 array data |
Date: | Mon, 21 Jan 2008 16:57:06 +0100 |
To: | bug-DBD-Pg [...] rt.cpan.org |
From: | André Warnier <aw [...] ice-sa.com> |
Hi.
There seems to be a problem in the handling of INSERT for utf-8 text[]
array data.
The platform :
# DBI Version 1.601
# DBD::Pg Version 2.0.0_4
# Perl Version 5.8.8
# OS linux
platform : Linux arthur 2.6.18-4-686 #1 SMP Wed May 9 23:03:12 UTC 2007
i686 GNU/Linux (Debian Etch)
Postgres version() :
PostgreSQL 8.1.11 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)
The error message :
[2008/01/21-13:18:40] [warning] DBD::Pg::st execute failed: ERROR:
invalid byte sequence for encoding "UTF8": 0xfe
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
This occurs when executing a prepared statement, loading some text data
into a text[] column, when the text data of one or more of the array
contains non-US-ASCII characters.
The target Pg database was created with an encoding of UTF8.
The process loading the data via DBD-Pg has LANG and LC_CTYPE set to
"en_GB.utf8". (using another locale does not change the problem).
I have also tried :
1) to set the attribute { pg_enable_utf8 => 1 } in the DBI->connect() call
2) after the connect call, to do :
binmode($dbh->{pg_socket},':utf8');
but none of those seem to make a difference.
Description of program :
The program reads an iso-8859-1 text data file, and loads this data
first into a Perl Hashtable, and from there into a Pg table via a
prepared and executed INSERT statement. The original data, and the
columns of the Pg table are type 'text', or text[] arrays.
The data is read into Perl via a filhandle opened with the proper I/O
layer, like : open(IN,'<:encoding('iso-8859-1',$file)
The data loading works fine, as long as the elements of the text arrays
do not contain multi-byte utf8 data (chr codes below 128 decimal). As
soon as one element contains a multi-byte utf8 character however, the
message above appears and the execute() fails.
The offending data looks originally like this :
[sbed element] = Arbeitsschutz
[sbed element] = Atemschutz
[sbed element] = Gruppeneinteilung
[sbed element] = Kenntlichmachung
[sbed element] = Markierung
[sbed element] = Schutzausrüstung
[sbed element] = Schutzgerät
[sbee element] = Classification
[sbee element] = Marking
[sbee element] = Markings
[sbee element] = Occupational safety
[sbee element] = Protective equipment
[sbee element] = Respiratory protections
[sbef element] = Sécurité industrielle
[sbef element] = Hygiène du travail
[sbef element] = Sécurité du travail
[sbef element] = Protection respiratoire
[sbef element] = Marquage
[sbef element] = Equipement de protection
If, prior to loading the data into Pg, I test it in my program with the
Encode::is_utf8() function, I can confirm that as far as Perl is
concerned, this is properly utf8-marked data.
If, prior to loading this data into Pg, I modify the strings above with
a regexp to replace the various "diacritic" characters by for example a
"-" (minus) sign, then the data loads fine.
The execute statment is as follows :
$sth->execute(@values_list)
where @values_list is an array contain elements that are either
- single scalar text variables
or
- references to arrays whose element contain individual text strings.
The problem above does not seem to appear for non-array 'text' columns,
even when the data contains utf8 multi-byte characters, and only appears
for array-type data and columns.
I am quite experienced at using utf8 data in Perl, and converting
to/from various encodings.
Best regards,
André