Bug #40199 for DBD-Pg: Identify Other Types as UTF-8?

Mon Oct 20 12:38:29 2008 dwheeler [...] cpan.org - Ticket created

Subject:	Identify Other Types as UTF-8?
Date:	Mon, 20 Oct 2008 09:37:43 -0700
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

Howdy, I need to be able to identify types other than the core text, varchar, and char types as UTF-8. For example, the citext module, due to be in contrib in 8.4, is a case-insensitive text type. I plan to use it a lot in my apps. Is it possible to tell DBD::Pg that it's UTF-8, and thus subject to whatever handling pg_enable_utf8 does? If it's not, I could really use that feature -- I'd likely use it for domains and enums, too. Maybe something like this? pg_enable_utf8 => [qw(citext seasons)], Here I'm passing in the names of the types to consider to be UTF-8: citext and an enum representing seasons of the year. Thoughts? Thanks, David

Mon Oct 27 21:07:08 2008 greg [...] turnstep.com - Correspondence added

Show quoted text

> pg_enable_utf8 => [qw(citext seasons)], > > Here I'm passing in the names of the types to consider to be UTF-8: > citext and an enum representing seasons of the year. > > Thoughts?

That might be doable, but I'd like to see a more generic solution. Can we maybe assume that things are a string unless we know otherwise? Or just test everything for utf8ness regardless? Or just switch the utf8 flag on if the data is coming from a utf8 compatible database? Other thoughts welcome...

Mon Oct 27 21:07:09 2008 The RT System itself - Status changed from 'new' to 'open'

Mon Oct 27 21:17:10 2008 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Mon, 27 Oct 2008 18:16:56 -0700
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Oct 27, 2008, at 18:07, Greg Sabino Mullane via RT wrote: Show quoted text

> That might be doable, but I'd like to see a more generic solution. Can > we maybe assume that things are a string unless we know otherwise?

Maybe. Show quoted text

> Or > just test everything for utf8ness regardless?

No, you can have binary data that looks like utf8 but isn't. Show quoted text

> Or just switch the utf8 > flag on if the data is coming from a utf8 compatible database?

Yes, bug again, it has to be known data types. Show quoted text

> Other > thoughts welcome...

Can you tell when a data type is in the string category? That would help. Tom added the ability for types to declare their categories, and citext does that in 8.4: http://www.archivum.info/pgsql.committers/2008-07/msg00333.html Best, David

Mon Oct 27 21:54:27 2008 greg [...] turnstep.com - Correspondence added

Show quoted text

> > Or just test everything for utf8ness regardless?

> > No, you can have binary data that looks like utf8 but isn't.

Right, but that should be a single known exception. Show quoted text

> Can you tell when a data type is in the string category? That would > help. Tom added the ability for types to declare their categories, > and citext does that in 8.4:

It's not clear if there is a system-catalog-level interface for that info, I'll check into it.

Mon Oct 27 23:14:06 2008 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Mon, 27 Oct 2008 20:13:53 -0700
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Oct 27, 2008, at 18:54, Greg Sabino Mullane via RT wrote: Show quoted text

>> No, you can have binary data that looks like utf8 but isn't.

> > Right, but that should be a single known exception. >

>> Can you tell when a data type is in the string category? That would >> help. Tom added the ability for types to declare their categories, >> and citext does that in 8.4:

> > It's not clear if there is a system-catalog-level interface for that > info, I'll check into it.

There's got to be a way for the database to tell the client what collation things are in. And I don't mean client_encoding, I mean the encoding for each piece of data that comes back. If there isn't, they're going to have to add it for table- and column- level encoding and collation support, I should think.

Tue Dec 09 08:51:28 2008 andrew [...] tao11.riddles.org.uk - Correspondence added

On Mon Oct 27 23:14:06 2008, DWHEELER wrote: Show quoted text

> On Oct 27, 2008, at 18:54, Greg Sabino Mullane via RT wrote: >

> >> No, you can have binary data that looks like utf8 but isn't.

> > > > Right, but that should be a single known exception. > >

> >> Can you tell when a data type is in the string category? That would > >> help. Tom added the ability for types to declare their categories, > >> and citext does that in 8.4:

> > > > It's not clear if there is a system-catalog-level interface for that > > info, I'll check into it.

> > There's got to be a way for the database to tell the client what > collation things are in. And I don't mean client_encoding, I mean the > encoding for each piece of data that comes back. > > If there isn't, they're going to have to add it for table- and column- > level encoding and collation support, I should think.

You're completely over-thinking this. Since you're not requesting binary-format results, then what you get back from the server is in text format. If the result is in text format, AND the value of client_encoding is 'UTF8', then the content is expected to be UTF8, REGARDLESS OF TYPE (yes, even for bytea, since it's escaped when you get it from the server). So the utf8 flag should be set in ALL such cases except when you've applied type-specific transformations to the result (such as in dequote_bytea). The code should absolutely not make assumptions about types. (If column-level encoding ever gets added, which is not happening in the forseeable future, then there'll have to be a function added somewhere that returns the column encoding, which you'd check instead of client_encoding.)

Tue Dec 09 17:32:51 2008 david [...] kineticode.com - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 9 Dec 2008 23:32:29 +0100
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <david [...] kineticode.com>

On Dec 9, 2008, at 2:51 PM, Andrew P. J. Gierth via RT wrote: Show quoted text

> You're completely over-thinking this. > > Since you're not requesting binary-format results, then what you get > back from the server is in text format. > > If the result is in text format, AND the value of client_encoding is > 'UTF8', then the content is expected to be UTF8, REGARDLESS OF TYPE > (yes, even for bytea, since it's escaped when you get it from the > server). So the utf8 flag should be set in ALL such cases except when > you've applied type-specific transformations to the result (such as in > dequote_bytea).

Well, this convinces me that I should ensure that client_encoding is always utf8. But what about for those clients where it's not? Maybe DBD::Pg should enforce that when pg_utf8 is true? Thanks for the details, Andrew. Best, David

Tue Dec 09 19:52:11 2008 andrew [...] tao11.riddles.org.uk - Correspondence added

On Tue Dec 09 17:32:51 2008, david@kineticode.com wrote: Show quoted text

> On Dec 9, 2008, at 2:51 PM, Andrew P. J. Gierth via RT wrote: >

> > You're completely over-thinking this. > > > > Since you're not requesting binary-format results, then what you get > > back from the server is in text format. > > > > If the result is in text format, AND the value of client_encoding is > > 'UTF8', then the content is expected to be UTF8, REGARDLESS OF TYPE > > (yes, even for bytea, since it's escaped when you get it from the > > server). So the utf8 flag should be set in ALL such cases except when > > you've applied type-specific transformations to the result (such as in > > dequote_bytea).

> > Well, this convinces me that I should ensure that client_encoding is > always utf8. But what about for those clients where it's not? Maybe > DBD::Pg should enforce that when pg_utf8 is true?

Two options: 1) if client_encoding is not UTF8, then pg_enable_utf8 should do nothing 2) if pg_enable_utf8 is set in the options to connect(), then perhaps it should specifically _request_ client_encoding=UTF8 ? (the only case where this would be undesirable is if the server_encoding is SQL_ASCII; the code would have to detect that.)

Wed Dec 10 01:12:16 2008 david [...] kineticode.com - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Wed, 10 Dec 2008 07:11:42 +0100
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <david [...] kineticode.com>

On Dec 10, 2008, at 1:52 AM, Andrew Gierth via RT wrote: Show quoted text

>> Well, this convinces me that I should ensure that client_encoding is >> always utf8. But what about for those clients where it's not? Maybe >> DBD::Pg should enforce that when pg_utf8 is true?

> > Two options: > > 1) if client_encoding is not UTF8, then pg_enable_utf8 should do > nothing > > 2) if pg_enable_utf8 is set in the options to connect(), then > perhaps it > should specifically _request_ client_encoding=UTF8 ? (the only case > where this would be undesirable is if the server_encoding is > SQL_ASCII; > the code would have to detect that.)

Almost sounds as if we don't need pg_enable_utf8 at all… Best, David

Wed Dec 10 01:40:10 2008 andrew [...] tao11.riddles.org.uk - Correspondence added

On Wed Dec 10 01:12:16 2008, david@kineticode.com wrote: Show quoted text

> > Almost sounds as if we don't need pg_enable_utf8 at all…

I'm not at all sure it _is_ needed.

Sun Jan 25 19:36:06 2009 greg [...] turnstep.com - Severity Important added

Sat Sep 26 05:32:22 2009 yorhel [...] cpan.org - Correspondence added

Just wanted to note that pg_enable_utf8 doesn't work for the xml data type added in the 8.3 core, either. Which, to my knowledge, works exactly the same as the other text data types.

Fri Oct 09 13:45:23 2009 rod.taylor [...] gmail.com - Correspondence added

On Wed Dec 10 01:40:10 2008, AGIERTH wrote: Show quoted text

> On Wed Dec 10 01:12:16 2008, david@kineticode.com wrote:

> > > > Almost sounds as if we don't need pg_enable_utf8 at all…

> > I'm not at all sure it _is_ needed. >

Basing it on the database encoding (client_encoding) and applying to all strings from the database works for me. I'm now running with the attached patch which essentially removes the switch around application of the SvUTF8_on flag. My database includes database arrays, lots of bytea (html pages, pdfs, jpegs, sound clips, etc.). Everything seems to work okay. I did this because converting a few fields to citext broke in the display as this flag was not being enabled.

*** dbdimp.c.orig Fri Oct 9 13:32:54 2009 --- dbdimp.c Fri Oct 9 13:33:25 2009 *************** *** 3410,3426 **** #ifdef is_utf8_string if (imp_dbh->pg_enable_utf8 && type_info) { SvUTF8_off(sv); ! switch (type_info->type_id) { ! case PG_CHAR: ! case PG_TEXT: ! case PG_BPCHAR: ! case PG_VARCHAR: ! if (is_high_bit_set(aTHX_ value, value_len) && is_utf8_string((unsigned char*)value, value_len)) { ! SvUTF8_on(sv); ! } ! break; ! default: ! break; } } #endif --- 3410,3417 ---- #ifdef is_utf8_string if (imp_dbh->pg_enable_utf8 && type_info) { SvUTF8_off(sv); ! if (is_high_bit_set(aTHX_ value, value_len) && is_utf8_string((unsigned char*)value, value_len)) { ! SvUTF8_on(sv); } } #endif

Fri Oct 09 13:52:23 2009 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Fri, 9 Oct 2009 10:51:59 -0700
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Oct 9, 2009, at 10:45 AM, Rod Taylor via RT wrote: Show quoted text

> Basing it on the database encoding (client_encoding) and applying to > all > strings from the database works for me.

+1 Show quoted text

> I'm now running with the attached patch which essentially removes the > switch around application of the SvUTF8_on flag. > > My database includes database arrays, lots of bytea (html pages, pdfs, > jpegs, sound clips, etc.). Everything seems to work okay. > > I did this because converting a few fields to citext broke in the > display as this flag was not being enabled.

Well, whatever comes back from the database should be converted from client_encoding to Perl's internal format (utf8, not to be confused with UTF-8). That way you get proper character semantics when doing things to multibyte strings (e.g., split). Your patch works if your database is UTF-8, but lots of folks use other database formats and/or client_encodings. DBD::Pg should just do the right thing by them, IMHO, which means passing everything through Encode::decode() (before unescaping binary data, I guess). Best, David

Fri Dec 18 08:54:54 2009 http://mawic.myopenid.com/ - Correspondence added

Please don't return utf8-flagged data to applications that haven't asked for it. Apps that do their own (unconditional) decoding will break if the utf8 flag is turned on unexpectedly and strings get garbled by double-decoding.

Sun Sep 05 11:28:04 2010 DROLSKY [...] cpan.org - Correspondence added

This problem is biting me now too, because I'm working on an app that uses the citext type (Silki on CPAN). I really like the idea of simply enabling utf-8 on all high-bit data (except for BYTEA columns) if the client encoding has been set to UTF8. For now, I'm going to have to do some app-level hacks, which is quite annoying.

Tue Sep 14 11:12:48 2010 DROLSKY [...] cpan.org - Cc DROLSKY added

Tue Nov 23 16:03:37 2010 dwheeler [...] cpan.org - Given to TURNSTEP

Tue Nov 23 16:12:18 2010 greg [...] turnstep.com - Correspondence added

Put some quick hacks into svn: please try it out!

Tue Nov 23 16:28:55 2010 DROLSKY [...] cpan.org - Correspondence added

On Tue Nov 23 16:12:18 2010, greg@turnstep.com wrote: Show quoted text

> Put some quick hacks into svn: please try it out!

I haven't tried the code but it doesn't look right. I don't think you want to turn on the utf8 flag for BYTEA columns _ever_, and you probably shouldn't turn it on for columns that just contain ASCII data (no high bit).

Tue Nov 23 16:32:16 2010 david [...] kineticode.com - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 23 Nov 2010 13:32:03 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <david [...] kineticode.com>

On Nov 23, 2010, at 1:28 PM, Dave Rolsky via RT wrote: Show quoted text

>> Put some quick hacks into svn: please try it out!

> > I haven't tried the code but it doesn't look right. I don't think you > want to turn on the utf8 flag for BYTEA columns _ever_, and you probably > shouldn't turn it on for columns that just contain ASCII data (no high bit).

And this would explain the current test failures. David

Tue Nov 30 11:59:11 2010 greg [...] turnstep.com - Correspondence added

On Tue Nov 23 16:28:55 2010, DROLSKY wrote: Show quoted text

> On Tue Nov 23 16:12:18 2010, greg@turnstep.com wrote:

> > Put some quick hacks into svn: please try it out!

> > I haven't tried the code but it doesn't look right. I don't think you > want to turn on the utf8 flag for BYTEA columns _ever_, and you

probably Show quoted text

> shouldn't turn it on for columns that just contain ASCII data (no high

bit). See the comments from Andrew above on why this should be okay for bytea columns. I'm also not sure of the harm in having it on for all data returned from a UTF-8 database.

Tue Nov 30 12:09:36 2010 dwheeler [...] cpan.org - Correspondence added

CC:	DROLSKY [...] cpan.org
Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 09:09:27 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Nov 30, 2010, at 8:59 AM, Greg Sabino Mullane via RT wrote: Show quoted text

> See the comments from Andrew above on why this should be okay for bytea > columns. I'm also not sure of the harm in having it on for all data > returned from a UTF-8 database.

You should be okay treating bytea as utf-8 when you get it back from the server, but a bytea value should *not* have the utf8 flag set. Best, David

Tue Nov 30 12:11:27 2010 DROLSKY [...] cpan.org - Correspondence added

On Tue Nov 30 11:59:11 2010, greg@turnstep.com wrote: Show quoted text

> On Tue Nov 23 16:28:55 2010, DROLSKY wrote:

> > On Tue Nov 23 16:12:18 2010, greg@turnstep.com wrote:

> > > Put some quick hacks into svn: please try it out!

> > > > I haven't tried the code but it doesn't look right. I don't think you > > want to turn on the utf8 flag for BYTEA columns _ever_, and you

> probably

> > shouldn't turn it on for columns that just contain ASCII data (no high

> bit). > > See the comments from Andrew above on why this should be okay for bytea > columns. I'm also not sure of the harm in having it on for all data > returned from a UTF-8 database.

To add to what David Wheeler said, imagine that I'm storing the binary representation of an image in a BYTEA column. If that's returned from the database in a marked-as-utf8 string, that means that to save it to the filesystem, I have to encode the output, which just makes no sense. Binary data should be binary. At the Perl level, that means _not_ marking it as utf8. Are people storing text in BYTEA columns? That doesn't make much sense when we have the TEXT type.

Tue Nov 30 12:23:02 2010 greg [...] turnstep.com - Correspondence added

Right, I think we're all in sync, just not communicating perfectly :). Strings, *once decoded*, should be marked as utf-8 only if not bytea. That part is not written yet, of course.

Tue Nov 30 12:26:33 2010 dwheeler [...] cpan.org - Correspondence added

CC:	DROLSKY [...] cpan.org
Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 09:26:25 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Nov 30, 2010, at 9:23 AM, Greg Sabino Mullane via RT wrote: Show quoted text

> Right, I think we're all in sync, just not communicating perfectly :). > Strings, *once decoded*, should be marked as utf-8 only if not bytea. > That part is not written yet, of course.

Okay, good, that sounds right. Thanks, David

Tue Nov 30 12:28:13 2010 autarch [...] urth.org - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 11:28:05 -0600 (CST)
To:	Greg Sabino Mullane via RT <bug-DBD-Pg [...] rt.cpan.org>
From:	Dave Rolsky <autarch [...] urth.org>

On Tue, 30 Nov 2010, Greg Sabino Mullane via RT wrote: Show quoted text

> Right, I think we're all in sync, just not communicating perfectly :). > Strings, *once decoded*, should be marked as utf-8 only if not bytea. > That part is not written yet, of course.

Yes, that makes sense. -dave

Tue Nov 30 14:05:52 2010 david [...] kineticode.com - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 11:02:48 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <david [...] kineticode.com>

Looks better with this patch. I now get only one test failure: not ok 52 - ASCII text returned from database does not have utf8 bit set # Failed test 'ASCII text returned from database does not have utf8 bit set' # at t/02attribs.t line 445. Not sure there's any harm in leaving it on for ASCII strings, though. Thoughts? David Begin forwarded message: Show quoted text

> From: turnstep@cvs.perl.org > Date: November 30, 2010 10:47:24 AM PST > To: svn-commit-modules-DBD-Pg@perl.org > Subject: [svn:DBD-Pg] r14556 - DBD-Pg/trunk > > Author: turnstep > Date: Tue Nov 30 10:47:24 2010 > New Revision: 14556 > > Modified: > DBD-Pg/trunk/dbdimp.c > > Log: > Quick attempt at refinining the utf8 logic. > May make no sense: I have a head cold and am lacking sleep. :) > > > Modified: DBD-Pg/trunk/dbdimp.c > ============================================================================== > --- DBD-Pg/trunk/dbdimp.c (original) > +++ DBD-Pg/trunk/dbdimp.c Tue Nov 30 10:47:24 2010 > @@ -3370,15 +3370,20 @@ > > for (i = 0; i < num_fields; ++i) { > SV *sv; > + int can_be_utf8; > > if (TRACE5) > TRC(DBILOGFP, "%sFetching field #%d\n", THEADER, i); > > sv = AvARRAY(av)[i]; > > + /* Only mark as utf8 if the type supports it (or is unknown) */ > + can_be_utf8 = DBDPG_TRUE; > + > TRACE_PQGETISNULL; > if (PQgetisnull(imp_sth->result, imp_sth->cur_tuple, i)!=0) { > SvROK(sv) ? (void)sv_unref(sv) : (void)SvOK_off(sv); > + can_be_utf8 = DBDPG_FALSE; > } > else { > TRACE_PQGETVALUE; > @@ -3397,6 +3402,7 @@ > /* For certain types, we can cast to non-string Perlish values */ > switch (type_info->type_id) { > case PG_BOOL: > + can_be_utf8 = DBDPG_FALSE; > if (imp_dbh->pg_bool_tf) { > *value = ('1' == *value) ? 't' : 'f'; > sv_setpvn(sv, (char *)value, value_len); > @@ -3407,10 +3413,15 @@ > case PG_OID: > case PG_INT4: > case PG_INT2: > + can_be_utf8 = DBDPG_FALSE; > sv_setiv(sv, atol((char *)value)); > break; > + case PG_BYTEA: > + /* Here solely to ensure it does not get set to utf8 */ > + can_be_utf8 = DBDPG_FALSE; > default: > sv_setpvn(sv, (char *)value, value_len); > + /* None of the above need to be utf8 */ > } > } > else { > @@ -3430,7 +3441,7 @@ > } > } > #ifdef is_utf8_string > - if (imp_dbh->is_utf8) { > + if (imp_dbh->is_utf8 && can_be_utf8) { > SvUTF8_on(sv); > } > #endif

Tue Nov 30 14:13:30 2010 autarch [...] urth.org - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 13:13:15 -0600 (CST)
To:	David Wheeler via RT <bug-DBD-Pg [...] rt.cpan.org>
From:	Dave Rolsky <autarch [...] urth.org>

On Tue, 30 Nov 2010, David Wheeler via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=40199 > > > Looks better with this patch. I now get only one test failure: > > not ok 52 - ASCII text returned from database does not have utf8 bit set > > # Failed test 'ASCII text returned from database does not have utf8 bit set' > # at t/02attribs.t line 445. > > Not sure there's any harm in leaving it on for ASCII strings, though. Thoughts?

In theory, this shouldn't really matter. I think the only real issue is for Latin-X characters that _aren't_ in ASCII, where turning on the utf8 flag changes their representation so they're multibyte chars, unlike the Latin-X representation. Of course, the other issue is that Perl's utf8 flag can end up leaking out by causing operations that combine strings to upgrade them all to utf8 if any one of them is. So all that said, it'd be more conservative to not turn on the flag for pure ASCII data, but I'm pretty sure any sane program won't be affected one way or the other. -dave /*============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================*/

Tue Nov 30 14:18:21 2010 david [...] kineticode.com - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 11:18:12 -0800
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <david [...] kineticode.com>

On Nov 30, 2010, at 11:13 AM, autarch@urth.org via RT wrote: Show quoted text

> In theory, this shouldn't really matter. I think the only real issue is > for Latin-X characters that _aren't_ in ASCII, where turning on the utf8 > flag changes their representation so they're multibyte chars, unlike the > Latin-X representation.

No such characters should come back from PostgreSQL anyway, if its encoding is utf-8. SQL-ASCII might mess things up, though, not sure. Show quoted text

> Of course, the other issue is that Perl's utf8 flag can end up leaking out > by causing operations that combine strings to upgrade them all to utf8 if > any one of them is. > > So all that said, it'd be more conservative to not turn on the flag for > pure ASCII data, but I'm pretty sure any sane program won't be affected > one way or the other.

I wonder if anyone has benchmarked it. Because otherwise we'd have to check to see if a string has only ascii characters and if so, *not* turn it on. Right? Maybe Perl has a C function or macro that does that? David

Tue Nov 30 14:21:56 2010 autarch [...] urth.org - Correspondence added

CC:	dwheeler [...] cpan.org, DROLSKY [...] cpan.org
Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 13:21:48 -0600 (CST)
To:	David Wheeler via RT <bug-DBD-Pg [...] rt.cpan.org>
From:	Dave Rolsky <autarch [...] urth.org>

On Tue, 30 Nov 2010, David Wheeler via RT wrote: Show quoted text

>> Of course, the other issue is that Perl's utf8 flag can end up leaking out >> by causing operations that combine strings to upgrade them all to utf8 if >> any one of them is. >> >> So all that said, it'd be more conservative to not turn on the flag for >> pure ASCII data, but I'm pretty sure any sane program won't be affected >> one way or the other.

> > I wonder if anyone has benchmarked it. Because otherwise we'd have to > check to see if a string has only ascii characters and if so, *not* turn > it on. Right? Maybe Perl has a C function or macro that does that?

I don't think this is a performance issue, it's a behavior issue. The way Perl can upgrade a string "behind one's back" can be confusing, although in theory this should be totally transparent. I think this only arises when someone has a string containing utf8 that is _not_ marked with the utf8 flag, and then trying to upgrade it causes a mess. That's not a "sane program" in my definition, since if you're working with utf8 data you need to make sure Perl knows that you're doing so, or you risk exactly this problem. -dave /*============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================*/

Tue Nov 30 15:49:29 2010 greg [...] turnstep.com - Correspondence added

Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 30 Nov 2010 20:49:21 -0000
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"Greg Sabino Mullane" <greg [...] turnstep.com>

-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Show quoted text

> In theory, this shouldn't really matter. I think the only real issue is > for Latin-X characters that _aren't_ in ASCII, where turning on the utf8 > flag changes their representation so they're multibyte chars, unlike the > Latin-X representation.

That won't matter anyway, as Postgres will expect things to be utf8 as well. Show quoted text

> Of course, the other issue is that Perl's utf8 flag can end up leaking out > by causing operations that combine strings to upgrade them all to utf8 if > any one of them is.

I'm not very concerned about this. If you query a UTF8 database, you should be prepared to deal with UTF8 output. :) Show quoted text

> So all that said, it'd be more conservative to not turn on the flag for > pure ASCII data, but I'm pretty sure any sane program won't be affected > one way or the other.

Eh, I'd rather not have to test each string as it comes out of the database. It's extra code to maintain, it's inefficient, it's potentially error prone, and it should not even be needed because we can simply trust the database to determine if the output is potentially utf8 or not. - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201011301548 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkz1YzcACgkQvJuQZxSWSsiiJQCgxWRzz2AQ5QNeKgmpA73WpP4M RhUAniMAQZJSBkgpfFWRBRTBHPybF1+m =t5Ie -----END PGP SIGNATURE-----

Thu Feb 17 04:40:45 2011 beldmit [...] gmail.com - Ticket #65819: Ticket created

Subject:	Missing hstore support for DBD::Pg
Date:	Thu, 17 Feb 2011 12:40:29 +0300
To:	bug-DBD-Pg [...] rt.cpan.org
From:	Dmitry Belyavsky <beldmit [...] gmail.com>

Greetings! There are problems in using DBD::Pg when work with HSTORE datatype in PostgreSQL 8.x/9.0. The database has the utf8 character set, but hhe hstore data doesn't have the utf8 flag on it. Also there is no native support for HSTORE type similar to array type. Thank you! -- SY, Dmitry Belyavsky

Thu Feb 17 11:27:36 2011 dwheeler [...] cpan.org - Ticket #65819: Merged into ticket #40199

Thu Feb 17 11:27:36 2011 dwheeler [...] cpan.org - Merged into ticket #40199

Thu Feb 17 11:29:25 2011 dwheeler [...] cpan.org - Correspondence added

On Thu Feb 17 04:40:45 2011, beldmit@gmail.com wrote: Show quoted text

> There are problems in using DBD::Pg when work with HSTORE datatype in > PostgreSQL 8.x/9.0. > The database has the utf8 character set, but hhe hstore data doesn't > have the utf8 flag on it.

Hi, I merged this report into the existing report for this issue. A solution is currently in development Show quoted text

> Also there is no native support for HSTORE type similar to array type.

That won't happen unless and until the HSTORE type becomes a core data type. In the meantime, It's pretty easy to parse an HSTORE value into a hash. See: http://www.depesz.com/index.php/2008/10/04/deserialization-of-hstore-data- structure-in-perl/ Best, David

Thu Feb 17 12:52:45 2011 beldmit [...] gmail.com - Correspondence added

CC:	DROLSKY [...] cpan.org
Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Thu, 17 Feb 2011 20:52:35 +0300
To:	bug-DBD-Pg [...] rt.cpan.org
From:	Dmitry Belyavsky <beldmit [...] gmail.com>

Greetings! On Thu, Feb 17, 2011 at 7:29 PM, David Wheeler via RT <bug-DBD-Pg@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=40199 > > > On Thu Feb 17 04:40:45 2011, beldmit@gmail.com wrote: >

>> There are problems in using DBD::Pg when work with HSTORE datatype in >> PostgreSQL 8.x/9.0. >> The database has the utf8 character set, but hhe hstore data doesn't >> have the utf8 flag on it.

> > Hi, I merged this report into the existing report for this issue. A solution is currently in > development >

>> Also there is no native support for HSTORE type similar to array type.

> > That won't happen unless and until the HSTORE type becomes a core data type. In the > meantime, It's pretty easy to parse an HSTORE value into a hash. See: > > http://www.depesz.com/index.php/2008/10/04/deserialization-of-hstore-data- > structure-in-perl/

Well, the parsing is not enough, it will be useful to store the HSTORE field from hashref. This requires at least correct quoting. Thank you! -- SY, Dmitry Belyavsky

Mon Jun 04 09:55:44 2012 nine [...] detonation.org - Correspondence added

Just hit this bug today with enum columns which contain UTF-8 encoded values. I independently came upon the following patch which passes DBD::Pg's test suite and our own and fixes the issue: diff -aur DBD-Pg-2.19.2/dbdimp.c DBD-Pg-2.19.2.fixed/dbdimp.c --- DBD-Pg-2.19.2/dbdimp.c 2012-03-12 21:35:33.000000000 +0100 +++ DBD-Pg-2.19.2.fixed/dbdimp.c 2012-06-04 15:53:29.179252348 +0200 @@ -3484,17 +3484,8 @@ #ifdef is_utf8_string if (imp_dbh->pg_enable_utf8 && type_info) { SvUTF8_off(sv); - switch (type_info->type_id) { - case PG_CHAR: - case PG_TEXT: - case PG_BPCHAR: - case PG_VARCHAR: - if (is_high_bit_set(aTHX_ value, value_len) && is_utf8_string((unsigned char*)value, value_len)) { - SvUTF8_on(sv); - } - break; - default: - break; + if (is_high_bit_set(aTHX_ value, value_len) && is_utf8_string((unsigned char*)value, value_len)) { + SvUTF8_on(sv); } } #endif

Mon Jun 04 09:56:45 2012 nine [...] detonation.org - Requestor NINE added

Tue Aug 21 13:22:09 2012 dwheeler [...] cpan.org - Correspondence added

I know quite a bit of work had been done to improve the encoding support in DBD::Pg, as well as a ton of discussion. http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html I'm just wondering what the status is? Have we agreed to an approach? What will it take to get it done? Thanks! David

Tue Aug 21 13:26:42 2012 greg [...] turnstep.com - Correspondence added

I think we have a rough consensus, I (we?) just need some tuits. I think we are going with a fairly simple approach. First, we always check the server encoding. If it's SQL_ASCII, we don't do much of anything. If it's (most) anything else, we flip the utf-8 bit on for our returned strings, unless people explicitly tell us not to. I think one sticking point was what, if anything, we should do to the data on the way in. Hmmmm, I think I need to revisit the code and discussion for a refresh before going any further... :)

Tue Aug 21 13:36:38 2012 dwheeler [...] cpan.org - Correspondence added

CC:	"Martin J. Evans" <martin.evans [...] easysoft.com>
Subject:	Re: [rt.cpan.org #40199] Identify Other Types as UTF-8?
Date:	Tue, 21 Aug 2012 10:36:27 -0700
To:	bug-DBD-Pg [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Aug 21, 2012, at 10:26 AM, Greg Sabino Mullane via RT wrote: Show quoted text

> Hmmmm, I > think I need to revisit the code and discussion for a refresh before > going any further... :)

Yeah. There was also the discussion on the DBI list. http://www.nntp.perl.org/group/perl.dbi.dev/2011/09/msg6635.html I think Tim decided what he wanted to do, but I am still not sure what it is. Martin, IIRC you talked to Tim about it quite a bit. Can you spell it out? Thanks, David

Tue Jul 02 00:44:26 2013 greg [...] turnstep.com - Correspondence added

Another attempt at d18362ff49c8be9c8e50c681844a58ef53ad4868 This tries to do the right thing by default (client_encoding UTF8 means the utf8 flag flipped on), and uses the old pg_enable_utf8 to force things on or off, with a recommendation not to use it at all.

Mon Sep 30 17:15:52 2013 dwheeler [...] cpan.org - Status changed from 'open' to 'patched'

Wed Feb 05 22:20:48 2014 greg [...] turnstep.com - Status changed from 'patched' to 'resolved'

Wed Feb 05 22:20:49 2014 greg [...] turnstep.com - Fixed in 3.0.0 added