Subject: | FW: DBD::DB2 UTF-8 Incompatibility / Bug |
Date: | Fri, 26 Jun 2009 18:06:21 +0200 |
To: | <bug-DBD-DB2 [...] rt.cpan.org> |
From: | GÜHRING Philipp <Philipp-Michael.Guehring [...] unicreditgroup.at> |
Hi,
I discovered an incompatibility between DBD::DB2 1.71
(with DB2-Connect V9.5 on Ubuntu Linux) and DB2 v9CM on z/OS.
When I SELECT a CHAR field that includes characters that
are multi-byte characters in UTF-8, then DBD::DB2 only allocates
the number of characters that the field has in general (SQL_DESC_DISPLAY_SIZE)
as the amount of bytes, retrieves that many bytes from DB2,
and cuts off the rest of the field.
Example:
A field has the content "Gühring" and is defined as CHAR(7).
The ü is a multi-byte character in UTF-8, therefore the
string is 8 Bytes long in UTF-8.
DBD::DB2 allocates only 7 bytes, discards the 8th byte,
and returns "Gührin" to my application, which breaks the application.
There are several issues:
* For querying, how much memory is needed, SQL_DESC_OCTET_LENGTH
should be used instead of SQL_DESC_DISPLAY_SIZE, I guess.
* Due to UTF-8 being dynamically multi-byte, the same CHAR field
can have various different lengths for every row.
The current code pre-allocates the fields with the field-length,
which would work with single-byte codepages, but it does not work
with multi-byte code-pages.
If you want to continue pre-allocating the needed memory you have
to allocate at least 4 bytes per character.
If you want to dynamically allocate it on every row individually
(like the BLOB handling), you can´t pre-allocate it for all rows.
A workaround that helps a bit is to do fbh->dsize*=4;
on line 1199 in the dbdimp.c , but that is not the whole solution yet.
Best regards,
Philipp Gühring