Bug #47429 for DBD-DB2: DB2-Perl - fieldlength error because of Unicode Multibyte Character

Mon Jun 29 05:28:28 2009 christian.miksch [...] at.ibm.com - Ticket created

CC:	Philipp-Michael.Guehring [...] unicreditgroup.at,

I discovered an incompatibility between DBD::DB2 1.71 (with DB2-Connect V9.5) and DB2 v9 on z/OS. When I SELECT a CHAR field that includes characters that are multi-byte characters in UTF-8, then DBD::DB2 only allocates the number of characters that the field has in general (SQL_DESC_DISPLAY_SIZE) as the amount of bytes, retrieves that many bytes from DB2, and cuts off the rest of the field. Example: A field has the content "Gühring" and is defined as CHAR(7). The ü is a multi-byte character in UTF-8, therefore the string is 8 Bytes long in UTF-8. DBD::DB2 allocates only 7 bytes, discards the 8th byte, and returns "Gührin" to my application, which completely breaks the application.. There are several issues: · For querying, how much memory is needed, you should use SQL_DESC_OCTET_LENGTH instead of SQL_DESC_DISPLAY_SIZE, I guess. · Due to UTF-8 being dynamically multi-byte, the same CHAR field can have various different lengthes for every row, so if you want to allocate the needed memory dynamically, you have to allocate it on every row individually (like BLOB´s I guess), you can´t pre-allocate it for all rows. A workaround that helps a bit is to do fbh->dsize*=4; on line 1199 in the dbdimp.c , but that´s not the whole solution yet.

Mon Jun 29 23:54:58 2009 opendev [...] us.ibm.com - Status changed from 'new' to 'open'

Mon Jun 29 23:55:03 2009 opendev [...] us.ibm.com - Taken

Tue Jun 30 02:36:48 2009 opendev [...] us.ibm.com - Correspondence added

RT-Send-CC:

opendev [...] us.ibm.com

Hi Philipp I would like to request you the following information to help me out reproduce this problem. 1. Test case that helps me insert special characters into the table. 2. Sample create database statement with the correct codepage example for Japanese Characters:- db2 CREATE DATABASE SJISDB USING CODESET IBM-943 TERRITORY JP On Mon Jun 29 05:28:28 2009, Chris27x wrote: Show quoted text

> > I discovered an incompatibility between DBD::DB2 1.71 (with DB2-Connect > V9.5) and DB2 v9 on z/OS. > > When I SELECT a CHAR field that includes characters that are multi-byte > characters in UTF-8, > > then DBD::DB2 only allocates the number of characters that the field has > in general (SQL_DESC_DISPLAY_SIZE) > > as the amount of bytes, retrieves that many bytes from DB2, and cuts off > the rest of the field. > > > Example: > > > A field has the content "Gühring" and is defined as CHAR(7). > > > The ü is a multi-byte character in UTF-8, therefore the string is 8 Bytes > long in UTF-8. > > > DBD::DB2 allocates only 7 bytes, discards the 8th byte, and returns > "Gührin" to my application, which completely breaks the application.. > > > There are several issues: > > · For querying, how much memory is needed, you should use > SQL_DESC_OCTET_LENGTH instead of SQL_DESC_DISPLAY_SIZE, I guess. > > > · Due to UTF-8 being dynamically multi-byte, the same CHAR field can > have various different lengthes for every row, so if you want to allocate > the needed memory dynamically, you have to allocate it on every row > individually (like BLOB´s I guess), you can´t pre-allocate it for all

rows. Show quoted text

> > > A workaround that helps a bit is to do fbh->dsize*=4; on line 1199 in the > dbdimp.c , but that´s not the whole solution yet.

-- Thanks Tarun Pasrija IBM OpenSource Application Development Team India Software Labs, Bangalore (India)

Tue Jun 30 08:07:06 2009 christian.miksch [...] at.ibm.com - Fixed in 1.71 deleted

Tue Jun 30 08:07:06 2009 christian.miksch [...] at.ibm.com - Broken in 1.71 added

Tue Jun 30 08:09:34 2009 christian.miksch [...] at.ibm.com - Subject changed from (no value) to 'DB2-Perl - fieldlength error because of Unicode Multibyte Character'

Tue Jun 30 10:48:00 2009 Philipp-Michael.Guehring [...] unicreditgroup.at - Correspondence added

Subject:	RE: [rt.cpan.org #47429]
Date:	Tue, 30 Jun 2009 16:48:41 +0200
To:	<bug-DBD-DB2 [...] rt.cpan.org>, <christian.miksch [...] at.ibm.com>
From:	GÜHRING Philipp <Philipp-Michael.Guehring [...] unicreditgroup.at>

Hi Tarun, The table is defined as a normal EBCDIC table on z/OS, with the default values for Austria/Germany. I guess that you should also be able to reproduce it with a Japanese database. To insert data into the table, you can try this: INSERT INTO EMPLOYEE ( EMP_ID, LASTNAME) VALUES ( 10035, 'Gühring-ÄÖÜß'); Try to set the locale on the client computer to: export LC_ALL=de_DE.UTF-8 export DB2CODEPAGE=1208 to reproduce the problem. I previously forgot to add the error messages I get from DBD::DB2: Warning: Data in column 1 has been truncated to 8 bytes. A maximum of 13 bytes are available at ims.pl line 25. Warning: Data in column 4 has been truncated to 8 bytes. A maximum of 12 bytes are available at ims.pl line 25. Warning: Data in column 0 has been truncated to 4018 bytes. A maximum of 4049 bytes are available at ims.pl line 41. Warning: Data in column 0 has been truncated to 4018 bytes. A maximum of 4024 bytes are available at ims.pl line 48. Warning: Data in column 0 has been truncated to 4018 bytes. A maximum of 4019 bytes are available at ims.pl line 48. Warning: Data in column 0 has been truncated to 4018 bytes. A maximum of 4020 bytes are available at ims.pl line 48. I will send you additional traces via email directly. Best regards, Philipp Gühring Show quoted text

-----Original Message----- From: IBM OpenDev via RT [mailto:bug-DBD-DB2@rt.cpan.org] Sent: Dienstag, 30. Juni 2009 08:37 To: christian.miksch@at.ibm.com Cc: GÜHRING Philipp Subject: [rt.cpan.org #47429] <URL: https://rt.cpan.org/Ticket/Display.html?id=47429 > Hi Philipp I would like to request you the following information to help me out reproduce this problem. 1. Test case that helps me insert special characters into the table. 2. Sample create database statement with the correct codepage example for Japanese Characters:- db2 CREATE DATABASE SJISDB USING CODESET IBM-943 TERRITORY JP On Mon Jun 29 05:28:28 2009, Chris27x wrote:

> > I discovered an incompatibility between DBD::DB2 1.71 (with DB2-Connect > V9.5) and DB2 v9 on z/OS. > > When I SELECT a CHAR field that includes characters that are multi-byte > characters in UTF-8, > > then DBD::DB2 only allocates the number of characters that the field has > in general (SQL_DESC_DISPLAY_SIZE) > > as the amount of bytes, retrieves that many bytes from DB2, and cuts off > the rest of the field. > > > Example: > > > A field has the content "Gühring" and is defined as CHAR(7). > > > The ü is a multi-byte character in UTF-8, therefore the string is 8 Bytes > long in UTF-8. > > > DBD::DB2 allocates only 7 bytes, discards the 8th byte, and returns > "Gührin" to my application, which completely breaks the application.. > > > There are several issues: > > · For querying, how much memory is needed, you should use > SQL_DESC_OCTET_LENGTH instead of SQL_DESC_DISPLAY_SIZE, I guess. > > > · Due to UTF-8 being dynamically multi-byte, the same CHAR field can > have various different lengthes for every row, so if you want to allocate > the needed memory dynamically, you have to allocate it on every row > individually (like BLOB´s I guess), you can´t pre-allocate it for all

rows.

> > > A workaround that helps a bit is to do fbh->dsize*=4; on line 1199 in the > dbdimp.c , but that´s not the whole solution yet.

-- Thanks Tarun Pasrija IBM OpenSource Application Development Team India Software Labs, Bangalore (India)

Tue Jun 30 11:54:55 2009 Philipp-Michael.Guehring [...] unicreditgroup.at - Correspondence added

Subject:	RE: [rt.cpan.org #47429]
Date:	Tue, 30 Jun 2009 17:55:33 +0200
To:	<bug-DBD-DB2 [...] rt.cpan.org>, <christian.miksch [...] at.ibm.com>
From:	GÜHRING Philipp <Philipp-Michael.Guehring [...] unicreditgroup.at>

Hi Tarun, I tested it on Windows now, and got the following results: I tried it with the default configuration of DBD::DB2 1.71 on ActivePerl 5.10.0, with DB2-Connect v9.1.401.444. Running it worked, and it gave a correct result in Windows - ANSI (1252) codepage. Nothing has been cut off. Then I set the environment variable: SET DB2CODEPAGE=1208 When I run it again, it gave lots of errors like "Warning: Data in column 1 has been truncated to 4018 bytes. A maximum of 4033 bytes are available at ims.pl line 47", and buffer has been cut off again. So you should be able to reproduce it by setting DB2CODEPAGE=1208 and having German Umlauts or Japanese characters in your table. Best regards, Philipp Gühring

Thu Jul 02 08:44:46 2009 opendev [...] us.ibm.com - Correspondence added

RT-Send-CC:

opendev [...] us.ibm.com

Hi Philipp Sorry for little late reply. I was unable to reproduce the problem and thus was tied up in trying the scenario you mentioned but finally succeeded. :) Well, I also noticed that you mentioned when you run windows with DB2Codepage set to 1252, it gives correct results. In our driver, we try to get the entire information from the CLI itself and let it handle the encodings. Thus, it is important for the client codepage to be correct. If we see the Infocenter, http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/topic/com.ibm.db2.luw.admin.nls.doc/doc/r0004565.html we notice that for the Language DE and platform windows your client codepage needs to be 1252 only and for Linux it needs to be 819 for locale de_DE. I tried changing the DB2Codepage to 819 and this works fine. Changing the DISPLAY_SIZE to octet length does not help since there is a mismatch in the codepages of the client and server and still truncation would happen. Thus, while converting from EBCDIC server database to UTF-8 on client truncation happens on incompatible codepages. I believe creating the database mentioning the codeset and territory and keeping the same configuration on the client too helps in case of 1208 and I have tried it too. Hope this helps. On Tue Jun 30 11:54:55 2009, Philipp-Michael.Guehring@unicreditgroup.at wrote: Show quoted text

> Hi Tarun, > > I tested it on Windows now, and got the following results: > > I tried it with the default configuration of DBD::DB2 1.71 on > ActivePerl 5.10.0, with DB2-Connect v9.1.401.444. > > Running it worked, and it gave a correct result in Windows - ANSI > (1252) codepage. Nothing has been cut off. > > Then I set the environment variable: > SET DB2CODEPAGE=1208 > > When I run it again, it gave lots of errors like "Warning: Data in > column 1 has been truncated to 4018 bytes. A maximum of 4033 bytes are > available at ims.pl line 47", and buffer has been cut off again. > > So you should be able to reproduce it by setting DB2CODEPAGE=1208 and > having German Umlauts or Japanese characters in your table. > > Best regards, > Philipp Gühring

-- Thanks Tarun Pasrija IBM OpenSource Application Development Team India Software Labs, Bangalore (India)

Fri Jul 03 01:28:59 2009 opendev [...] us.ibm.com - Priority changed from (no value) to '2'

Wed Jul 10 07:36:49 2013 JACOB [...] cpan.org - Correspondence added

RT-Send-CC:

opendev [...] us.ibm.com

Hi, if ( app_codepage != db_codepage) { fbh->rlen = bufferSizeRequired = (4*fbh->dsize)+1;/* +1: STRING null terminator */ } else { fbh->rlen = bufferSizeRequired = fbh->dsize+1;/* +1: STRING null terminator */ } I think this is the faulty logic here. It has evolved since the original report, but there's still one case where it fails: When application and database both use the same MBCS, e.g. 1208 (UTF-8).

Wed Jul 10 08:03:59 2013 JACOB [...] cpan.org - Broken in 1.84 added

Wed Jul 10 08:03:59 2013 JACOB [...] cpan.org - Broken in 1.85 added