Bug #95214 for DBD-Pg: DBD::Pg 3.x BYTEA issues with UTF-8 encoded database

Wed Apr 30 12:49:07 2014 michael.graziano [...] premierheart.com - Ticket created

Subject:	DBD::Pg 3.x BYTEA issues with UTF-8 encoded database
Date:	Wed, 30 Apr 2014 12:48:51 -0400
To:	bug-DBD-Pg [...] rt.cpan.org
From:	Michael Graziano <michael.graziano [...] premierheart.com>

We’ve encountered a regression in DBD::Pg 3.0 which I believe is related to the new unicode handling - bytea strings being passed to the database are being mangled when inserting into a utf-8 encoded database. The affected perl code worked on both SQL_ASCII and utf-8 encoded databases under the 2.19.3 DBD::Pg driver. Under the 3.x driver the bytea data is corrupted when we insert into a utf-8 encoded database (the inserted data is longer than the contents being inserted), but the code continues to work normally on a SQL_ASCII encoded database. The affected code (extract) is: ####### BEGIN EXCERPT ####### # Generate File1 $file1_contents = pack ( "a13a4a2a17a14a13a17a16a2LSaaaaaaa20a12aaa94Sa11LS”, $fn, $a, $s, $n, 0, $h, $cid, "0000000", "N", $date, $hid, "", "", "", "", "", "", "", "", "", "", "NNNNNNNNNN", 0, "", &UnixDate($db, “%s"), 0x6E ); # Generate File2 from File1 and raw data ## raw_data is from sysread() $file2_contents = pack( “a256a32776”, $file1_contents, $raw_data ); $stmt = $db->prepare(“INSERT INTO $table ( field1, field2) VALUES ( ?, ? ) “ ); if (! defined($stmt) ) { show_error( "Unable to prepare INSERT $table : $DBI::errstr" ); $db->disconnect(); exit -3; } $stmt->bind_param( 1, $file1_contents, { pg_type => DBD::Pg::PG_BYTEA } ); $stmt->bind_param( 2, $file2_contents, { pg_type => DBD::Pg::PG_BYTEA } ); if (! $stmt->execute() ) { show_error( "Unable to INSERT $table : $DBI::errstr" ); $stmt->finish; $db->disconnect(); exit -4; } $stmt->finish; ######## END EXCERPT ######## It appears this is still the correct way to handle bytea inserts and should “Just Work”, so I’m raising this as a bug - if we’re screwing up something obvious please let me know. (Unfortunately recoding to SQL_ASCII is absolutely not an option for us - in-database encoding enforcement is needed.) OS Information: FreeBSD dbi01.dev.premierheart.com 9.1-RELEASE-p11 FreeBSD 9.1-RELEASE-p11 #2 r264302: Wed Apr 9 16:36:12 EDT 2014 root@SYSBUILDER.dev.premierheart.com:/usr/obj/usr/src/sys/ph-amd64 amd64 Database Information: PostgreSQL 9.3.1 Perl Information: This is perl 5, version 18, subversion 2 (v5.18.2) built for amd64-freebsd-thread-multi Tested broken on: DBD::Pg 3.0.0 ; DBD::Pg 3.1.1 Using DBI 1.631

Download signature.asc
application/pgp-signature 203b

Message body not shown because it is not plain text.

Wed Apr 30 14:13:59 2014 greg [...] turnstep.com - Taken

Wed Apr 30 14:29:27 2014 greg [...] turnstep.com - Correspondence added

That *should* work, sorry to hear it is not. Can you make the test case a little more self-contained so I can duplicate it (e.g. I don't know what $fn, $a, $s, $n, 0, $h, $cid) are set to. A slightly modified version worked for me, so ideally I would use the exact same data and table. Also, what error are you seeing / how do you know it's the wrong length?

Wed Apr 30 14:29:28 2014 The RT System itself - Status changed from 'new' to 'open'

Wed Apr 30 14:55:40 2014 michael.graziano [...] premierheart.com - Correspondence added

Subject:	Re: [rt.cpan.org #95214] DBD::Pg 3.x BYTEA issues with UTF-8 encoded database
Date:	Wed, 30 Apr 2014 14:55:23 -0400
To:	bug-DBD-Pg [...] rt.cpan.org
From:	Michael Graziano <michael.graziano [...] premierheart.com>

On Apr 30, 2014, at 2:29 PM, Greg Sabino Mullane via RT <bug-DBD-Pg@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=95214 > > > That *should* work, sorry to hear it is not. Can you make the test case a little more self-contained so I can duplicate it (e.g. I don't know what $fn, $a, $s, $n, 0, $h, $cid) are set to. A slightly modified version worked for me, so ideally I would use the exact same data and table. Also, what error are you seeing / how do you know it's the wrong length?

I should be able to gimmick up a more friendly test case pretty easily - We initially went down a rabbit hole of “Uh, did perl break pack()?” so I’ve got half the test case already written and just need to tack on some live data and a test table (if I can’t get to it today I’ll bang on it tonight and should have something externally reproducible tomorrow). What we’re seeing in terms of corruption is definitely strange — we’re not seeing any errors on insert, but when we retrieve “file2” from the DB it’s coming back several bytes longer than it “should be” based on the pack() format (the thing that consumes this data is some grotty legacy code that among its sins uses file length to verify correct format, so it sees the file is oversized and bails). I’ve managed to rule out problems with the retrieval script (ruby) because data inserted with the 2.19 driver comes out OK so my current suspect is some strangeness in Perl’s internal utf-8 handling that’s bleeding over onto the bytea data (since we’re not doing anything special for utf-8 mode in our script my understanding is the 2.x driver would have just claimed ignorance and assumed the DB would sort it out).

Download signature.asc
application/pgp-signature 203b

Message body not shown because it is not plain text.

Wed Apr 30 16:01:04 2014 michael.graziano [...] premierheart.com - Correspondence added

Subject:	Re: [rt.cpan.org #95214] DBD::Pg 3.x BYTEA issues with UTF-8 encoded database
Date:	Wed, 30 Apr 2014 16:00:42 -0400
To:	bug-DBD-Pg [...] rt.cpan.org
From:	Michael Graziano <michael.graziano [...] premierheart.com>

Quicker than I thought — reproducibly broken test script and testing database. I tracked the issue in a little deeper: It seems to have to do with having the UTF-8 bit set on strings that came out of the database which we then wind up pack()ing and re-inserting in the bytea fields. (Perl seems to be tagging the resulting packed values as utf-8 strings — which makes sense to Perl, but not to Postgres in this context since we’re expecting to send it “raw binary (bytea) data” and perl is passing it along as “cooked UTF-8 string”).

Download p95214.sql
application/octet-stream 1.4k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

On Apr 30, 2014, at 2:55 PM, Michael Graziano <michael.graziano@premierheart.com> wrote: Show quoted text

> > On Apr 30, 2014, at 2:29 PM, Greg Sabino Mullane via RT <bug-DBD-Pg@rt.cpan.org> wrote: >

>> <URL: https://rt.cpan.org/Ticket/Display.html?id=95214 > >> >> That *should* work, sorry to hear it is not. Can you make the test case a little more self-contained so I can duplicate it (e.g. I don't know what $fn, $a, $s, $n, 0, $h, $cid) are set to. A slightly modified version worked for me, so ideally I would use the exact same data and table. Also, what error are you seeing / how do you know it's the wrong length?

> > I should be able to gimmick up a more friendly test case pretty easily - We initially went down a rabbit hole of “Uh, did perl break pack()?” so I’ve got half the test case already written and just need to tack on some live data and a test table (if I can’t get to it today I’ll bang on it tonight and should have something externally reproducible tomorrow). > > What we’re seeing in terms of corruption is definitely strange — we’re not seeing any errors on insert, but when we retrieve “file2” from the DB it’s coming back several bytes longer than it “should be” based on the pack() format (the thing that consumes this data is some grotty legacy code that among its sins uses file length to verify correct format, so it sees the file is oversized and bails). > > I’ve managed to rule out problems with the retrieval script (ruby) because data inserted with the 2.19 driver comes out OK so my current suspect is some strangeness in Perl’s internal utf-8 handling that’s bleeding over onto the bytea data (since we’re not doing anything special for utf-8 mode in our script my understanding is the 2.x driver would have just claimed ignorance and assumed the DB would sort it out). >

Download signature.asc
application/pgp-signature 203b

Message body not shown because it is not plain text.

Fri May 02 19:29:30 2014 greg [...] turnstep.com - Correspondence added

DBD::Pg should definitely not be marking bytea fields as UTF-8. It's quite possible it is when it ought not to be. Unfortunately, I'm not sure what your test script is supposed to demonstrate. The only output is: File1 Size: 257 File2 Size: 33032

Mon May 05 15:39:49 2014 michael.graziano [...] premierheart.com - Correspondence added

Subject:	Re: [rt.cpan.org #95214] DBD::Pg 3.x BYTEA issues with UTF-8 encoded database
Date:	Mon, 5 May 2014 15:39:22 -0400
To:	bug-DBD-Pg [...] rt.cpan.org
From:	Michael Graziano <michael.graziano [...] premierheart.com>

Oops - left out an important bit that I was doing manually (getting the inserted sizes from the database) - Updated version which pulls the data lengths (octet_size()) from Postgres attached. Note also that there are two blocks for setting $pname (the string that seems to cause the misbehavior): The default operating mode (lines 24-29) gets this value out of the database (utf-8 encoded), so it results in a UTF-8 string that seems to upset things - you should get different sizes out of the database, similar to: File1 Size: 257 ; DB Size: 262 File2 Size: 33032 ; DB Size: 41034 If you comment out that block and uncomment line 35 (setting the value to an ASCII string), the code behaves as expected (File sizes and DB sizes match). (The script also behaves as expected if the database is created with SQL_ASCII encoding instead of UTF-8 - presumably because the $pname string isn’t getting tagged as utf-8 in that case). -MG

Message body is not shown because sender requested not to inline it.

On May 2, 2014, at 7:29 PM, Greg Sabino Mullane via RT <bug-DBD-Pg@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=95214 > > > DBD::Pg should definitely not be marking bytea fields as UTF-8. It's quite possible it is when it ought not to be. Unfortunately, I'm not sure what your test script is supposed to demonstrate. The only output is: > > File1 Size: 257 > File2 Size: 33032 > >

Download signature.asc
application/pgp-signature 203b

Message body not shown because it is not plain text.

Thu May 08 11:27:36 2014 greg [...] turnstep.com - Correspondence added

I think I've whittled this down to a better test case. I suspect the problem lies in comparing len($string) with octet_length(string). The former is character based, the latter is byte based. Try this: use Data::Peek; $pname = "ABC"; utf8::upgrade($pname); $file1_contents = pack("aS", ## unsigned short $pname, 65535); ... print DPeek($file1_contents); print "\n"; print "File1 Size: $len ; DB Size: $res[0]\n"; This should be the same as your test case: ultimately it's the first string being UTF-8 plus the unsigned short packing that creates a string that is different. The output: PV("A\303\277\303\277"\0) [UTF8 "A\x{ff}\x{ff}"] File1 Size: 3 ; DB Size: 5 So both are correct: this is a three character string, but it has five bytes. If I'm understanding something wrong, please let me know; see if you can verify that the test case above still represents the problem you are seeing.

Tue May 27 10:29:20 2014 ilmari+cpan [...] ilmari.org - Correspondence added

On 2014-05-08 16:27:36, TURNSTEP wrote: Show quoted text

> I think I've whittled this down to a better test case. I suspect the > problem lies in comparing len($string) with octet_length(string). The > former is character based, the latter is byte based. Try this:

length() in perl and octet_length() in Postgres should be the same, since in Postgres they are the same for bytea, and in perl the code points are the only thing that matters semantically. Show quoted text

> use Data::Peek; > $pname = "ABC"; > utf8::upgrade($pname); > $file1_contents = pack("aS", ## unsigned short > $pname, 65535); > ... > print DPeek($file1_contents); print "\n"; > print "File1 Size: $len ; DB Size: $res[0]\n"; > > This should be the same as your test case: ultimately it's the first > string being UTF-8 plus the unsigned short packing that creates a > string that is different. The output: > PV("A\303\277\303\277"\0) [UTF8 "A\x{ff}\x{ff}"] > File1 Size: 3 ; DB Size: 5 > > So both are correct: this is a three character string, but it has five > bytes. > > If I'm understanding something wrong, please let me know; see if you > can verify that the test case above still represents the problem you > are seeing.

The problem is that DBD:Pg 3.[0-2] is sending the internal representation of the upgraded string ("A\303\277\303\277") to Pg, not the code points it represents ("A\x{ff}\x{ff}"). This is fixed in git master, by making sure values bound as BYTEA are properly downgraded: https://github.com/bucardo/dbdpg/commit/1212e3e47339dd4cf21d83f6ef71dbd9e6be3364

Wed May 28 06:45:32 2014 greg [...] turnstep.com - Correspondence added

Show quoted text

> This is fixed in git master, by making sure values bound as BYTEA are > properly downgraded: > https://github.com/bucardo/dbdpg/commit/1212e3e47339dd4cf21d83f6ef71dbd9e6be3364

Thanks, marking as patched.

Wed May 28 06:45:35 2014 greg [...] turnstep.com - Status changed from 'open' to 'patched'

Thu Jun 26 10:28:16 2014 greg [...] turnstep.com - Status changed from 'patched' to 'resolved'

Thu Jun 26 10:28:16 2014 greg [...] turnstep.com - Severity Critical added

Thu Jun 26 10:28:17 2014 greg [...] turnstep.com - Fixed in 3.3.0 added