Bug #70819 for DBD-Oracle: AL32UTF8 Strings Should be Decoded, not just Flagged utf8

Thu Sep 08 16:31:08 2011 dwheeler [...] cpan.org - Ticket created

Subject:	AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Thu, 8 Sep 2011 13:30:55 -0700
To:	bug-DBD-Oracle [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

The DBD::Oracle docs state: Show quoted text

> Any data returned from Oracle to DBD::Oracle in the AL32UTF8 character set will be marked as UTF-8 to ensure correct handling by Perl. > > For Oracle to return data in the AL32UTF8 character set the NLS_LANG or NLS_NCHAR environment variable must be set as described in the previous section.

I've recently written an app that connects to an Oracle 9 server with code like this: BEGIN { $ENV{NLS_LANG} = 'AMERICAN_AMERICA.AL32UTF8'; } my $dbh = DBI->connect("dbi:Oracle:$sid", $username, $password, { RaiseError => 1, PrintError => 0, AutoCommit => 0, }); With this pattern, I text data fetched from the database does in fact have the utf8 flag enabled on the appropriate values. However, not all text data returned from the database is in fact UTF-8. So yes, I've told DBD::Oracle that the data should be UTF-8, but Oracle itself sometimes has other opinions. The upshot is that I can end up with text values with the utf8 flag enabled but that contain invalid UTF-8. I suspect that this is because DBD::Oracle simply turns on that flag, rather than using Perl's Unicode API to *decode* the text. In other words, I think that DBD::Oracle does the equivalent of this in C: Encode::_utf8_on($value); When what it should be doing is something like this: $value = Encode::decode('UTF-8', $value, Encode::FB_CROAK); This would ensure that the data was, in fact, UTF-8, and would die if it was not. It also would be better because Perl's internal representation, known as "utf8", is not the same as UTF-8 (it's more relaxed about what it considers UTF-8). More details on Unicode gotchas are cataloged here: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 FWIW, I'm currently working around this bug by turning off the utf8 flag and decoding myself: my @encodings = map { find_encoding $_ } qw(utf-8 cp1252 iso-8859-1); if (defined $value) { # DBD::Oracle turns on utf8 without validating. Encode::_utf8_off($value); # We need to decode the value. my $found = 0; for my $enc (@encodings) { local $@; eval { $value = $enc->decode($value, Encode::FB_CROAK) }; next if $@; $found = 1; last; } die "Cannot decode $value" unless $found; } Fortunately our strings are broken in particular ways (encoded as CP1252 or latin-1 when not properly UTF-8) so this is fine for me. But I do think that naïvely turning on the utf8 flag is not the proper way to handle UTF-8 values fetched from the database. Best, David

Thu Sep 08 17:09:19 2011 champoux [...] pythian.com - Correspondence added

Hi David, On Thu Sep 08 16:31:08 2011, DWHEELER wrote: [...] Show quoted text

> In other words, I think that DBD::Oracle does the equivalent of this > in C: > > Encode::_utf8_on($value); > > When what it should be doing is something like this: > > $value = Encode::decode('UTF-8', $value, Encode::FB_CROAK);

Gotcha. At first read, it makes sense. I began to look at the code that seems to be doing that (around line 454 of Oracle.xs), but I'll need more coffee and sacrifices to the dark gods before I have a clear view of what to do. I'll try to keep the CR ticket updated with my findings. Wish me luck. :-) `/anick

Thu Sep 08 17:09:20 2011 The RT System itself - Status changed from 'new' to 'open'

Fri Sep 09 04:29:47 2011 Tim.Bunce [...] pobox.com - Correspondence added

Subject:	Re: [rt.cpan.org #70819] AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Fri, 9 Sep 2011 09:29:23 +0100
To:	Pythian Remote DBA via RT <bug-DBD-Oracle [...] rt.cpan.org>
From:	Tim Bunce <Tim.Bunce [...] pobox.com>

On Thu, Sep 08, 2011 at 05:09:19PM -0400, Pythian Remote DBA via RT wrote: Show quoted text

> On Thu Sep 08 16:31:08 2011, DWHEELER wrote: > [...]

> > In other words, I think that DBD::Oracle does the equivalent of this > > in C: > > Encode::_utf8_on($value); > > When what it should be doing is something like this: > > $value = Encode::decode('UTF-8', $value, Encode::FB_CROAK);

> > Gotcha. At first read, it makes sense. I began to look at the code that > seems to be doing that (around line 454 of Oracle.xs), but I'll need > more coffee and sacrifices to the dark gods before I have a clear view > of what to do. I'll try to keep the CR ticket updated with my findings. > Wish me luck. :-)

I'm a little skeptical about this issue. Firstly the performance impact will be very high. Secondly the bytes in a AL32UTF8 character string stored in Oracle should be valid and if it's not then it's Oracle's fault. It seems like you're asking DBD::Oracle, and all it's users, to do checking that Oracle should have done itself. I'd like to see a test case that reproduces the problem first. Tim.

Fri Sep 09 06:15:20 2011 bohica [...] ntlworld.com - Correspondence added

On Thu Sep 08 17:09:19 2011, PYTHIAN wrote: Show quoted text

> Hi David, > > On Thu Sep 08 16:31:08 2011, DWHEELER wrote: > [...]

> > In other words, I think that DBD::Oracle does the equivalent of this > > in C: > > > > Encode::_utf8_on($value); > > > > When what it should be doing is something like this: > > > > $value = Encode::decode('UTF-8', $value, Encode::FB_CROAK);

> > Gotcha. At first read, it makes sense. I began to look at the code

that Show quoted text

> seems to be doing that (around line 454 of Oracle.xs), but I'll need > more coffee and sacrifices to the dark gods before I have a clear view > of what to do. I'll try to keep the CR ticket updated with my

findings. Show quoted text

> Wish me luck. :-) > > `/anick

Yanick, Just for your information, DBD::Oracle already decodes the data from Oracle in some cases as in oci8.c (example from oci_error_err): if (CSFORM_IMPLIES_UTF8(SQLCS_IMPLICIT)) { #ifdef sv_utf8_decode sv_utf8_decode(errstr_sv); #else SvUTF8_on(errstr_sv); #endif } Without the above, I could not get unicode error messages back from Oracle. However, as you can see, CSFORM_IMPLIES_UTF8 needs to be true and Perl needs to be new enough to have sv_utf8_decode or it falls back on the old behaviour of SvUTF8_on. Perhaps some data is retrieved which is missing the above. e.g., fetch_func_varfield, ora_blob_read_mb_piece, ora_blob_read_piece,fetch_log etc seem to only call SvUTF8_on. Martin -- Martin J. Evans Wetherby, UK

Fri Sep 09 09:50:02 2011 champoux [...] pythian.com - Correspondence added

Subject:	Re: [rt.cpan.org #70819] AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Fri, 09 Sep 2011 09:49:50 -0400
To:	bug-DBD-Oracle [...] rt.cpan.org
From:	Yanick Champoux <champoux [...] pythian.com>

Gentlemen, Martin => thanks for the pointers. Tim => thanks for the inputs. And yeah, testcases will be written before the code is changed. And everything will be passed through the dbi-dev mailing list for feedback as well, as this has all the hallmarks of a joyful an of worm. :-) Joy, `/anick -- The best compliment you could give Pythian for our service is a referral.

Fri Sep 09 11:17:17 2011 Tim.Bunce [...] pobox.com - Correspondence added

Subject:	Re: [rt.cpan.org #70819] AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Fri, 9 Sep 2011 16:16:53 +0100
To:	Pythian Remote DBA via RT <bug-DBD-Oracle [...] rt.cpan.org>
From:	Tim Bunce <Tim.Bunce [...] pobox.com>

On Fri, Sep 09, 2011 at 09:50:03AM -0400, Pythian Remote DBA via RT wrote: Show quoted text

> Tim => thanks for the inputs. And yeah, testcases will be written > before the code is changed.

Just to clarify, I'm looking for Dave to demonstrate a way to get an invalid UTF-8 encoding into Oracle that isn't an Oracle bug. Without that I don't think this is a DBD::Oracle 'bug'. Even with it, I'd view the request as "provide a way to optionally enable expensive encoding checks on everything read from the db just in case the db gets it wrong sometimes". Tim.

Fri Sep 09 12:07:52 2011 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #70819] AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Fri, 9 Sep 2011 09:07:39 -0700
To:	bug-DBD-Oracle [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Sep 9, 2011, at 8:17 AM, Tim Bunce via RT wrote: Show quoted text

> Just to clarify, I'm looking for Dave to demonstrate a way to get an > invalid UTF-8 encoding into Oracle that isn't an Oracle bug.

Who's Dave? Show quoted text

> Without that I don't think this is a DBD::Oracle 'bug'.

I agree that Oracle should be responsible for verifying the character encoding. The fact that I have access to a database with broken data in it at $work has created no end of annoyances. Alas, I don't know what combination of poor code, encoding selections, and database upgrades led to this sad state of affairs. I assure you, that such is the case here. Show quoted text

> Even with it, I'd view the request as "provide a way to optionally > enable expensive encoding checks on everything read from the db just in > case the db gets it wrong sometimes".

Well, the other reason I bring this up is because utf8 != UTF-8. IIUC correctly (and that's highly questionable), Perl's internal representation is a more lax version of UTF-8. If so, then ideally when sending data *to* the database, it should be *encoded* as UTF-8 rather than utf8. Truth to tell, though, there's a rather important conversation to be had about encoding and the DBI that should take place on the dbi-dev list. Here's the relevant DBD::Pg thread for context: http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html Ideally we'd get tchrist involved, as he knows better than anyone where the unicode bodies are interred. Should I start a thread on dbi-dev? Best, David

Fri Sep 09 14:38:30 2011 Tim.Bunce [...] pobox.com - Correspondence added

Subject:	Re: [rt.cpan.org #70819] AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Fri, 9 Sep 2011 19:38:06 +0100
To:	David Wheeler via RT <bug-DBD-Oracle [...] rt.cpan.org>
From:	Tim Bunce <Tim.Bunce [...] pobox.com>

On Fri, Sep 09, 2011 at 12:07:53PM -0400, David Wheeler via RT wrote: Show quoted text

> Who's Dave?

Oops. Sorry David. Show quoted text

> Well, the other reason I bring this up is because utf8 != UTF-8. IIUC > correctly (and that's highly questionable), Perl's internal > representation is a more lax version of UTF-8. If so, then ideally > when sending data *to* the database, it should be *encoded* as UTF-8 > rather than utf8.

Creating a codepoint that's invalid UTF-8 is pretty unlikely. Certainly not worth the cost of encoding *by default*. Show quoted text

> Truth to tell, though, there's a rather important conversation to be > had about encoding and the DBI that should take place on the dbi-dev > list. Here's the relevant DBD::Pg thread for context: > > http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html

I've skimmed it. Thanks. Show quoted text

> Ideally we'd get tchrist involved, as he knows better than anyone > where the unicode bodies are interred.

Let's not. I don't see any need at the moment. Show quoted text

> Should I start a thread on dbi-dev?

Yeap! That would be great. Thanks. Tim.

Fri Sep 09 14:43:44 2011 dwheeler [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #70819] AL32UTF8 Strings Should be Decoded, not just Flagged utf8
Date:	Fri, 9 Sep 2011 11:43:33 -0700
To:	bug-DBD-Oracle [...] rt.cpan.org
From:	"David E. Wheeler" <dwheeler [...] cpan.org>

On Sep 9, 2011, at 11:38 AM, Tim Bunce via RT wrote: Show quoted text

>> Who's Dave?

> > Oops. Sorry David.

Oh, okay. I didn't think Rolsky was on this thread, but one never knows (he's on a big DBD::Pg Unicode bug thread). Show quoted text

> Creating a codepoint that's invalid UTF-8 is pretty unlikely. > Certainly not worth the cost of encoding *by default*.

And yet that's exactly what I ran into. I can send some example strings pulled directly from Oracle. Show quoted text

>> Ideally we'd get tchrist involved, as he knows better than anyone >> where the unicode bodies are interred.

> > Let's not. I don't see any need at the moment.

Okay. Maybe if questions of details came up. Show quoted text

>> Should I start a thread on dbi-dev?

> > Yeap! That would be great. Thanks.

Okay. Best, David

Wed Dec 03 12:33:40 2014 bergner [...] cs.umu.se - Correspondence added

From:

bergner [...] cs.umu.se

I'm curious to know what became of this. I'm seeing the following strange UTF-8 behaviour on Oracle 11.2 and 12 on Linux x86_64, terminal encoding is set to UTF-8. Setup is pretty much equivalent on 11.2 vs. 12 except version number in Oracle paths. LC_CTYPE=en_US.UTF-8 LANG=en_US.UTF-8 NLS_LANG=.AL32UTF8 ORACLE_SID=ORADB1 ORACLE_BASE=/opt/oracle/app/oracle ORACLE_HOME=/opt/oracle/app/oracle/product/12cR1 LD_LIBRARY_PATH=/opt/oracle/app/oracle/product/12cR1/lib I've tried it with the following Perl, DBI, DBD::Oracle combination. Same results on all of them. Perl 5.10.1 + DBI 1.609 + DBD::Oracle 1.56 Perl 5.18.4 + DBI 1.630 + DBD::Oracle 1.68 Perl 5.18.4 + DBI 1.630 + DBD::Oracle 1.75_2 Database NLS parameters as follows: NLS_CHARACTERSET AL32UTF8 NLS_NCHAR_CHARACTERSET UTF8 On the Oracle 12c system NLS_NCHAR_CHARACTERSET is set to AL16UTF16 but it really shouldn't matter since I'm only using regular VARCHAR2. Create a simple table in sqlplus and insert one row. I do not have any corrupt data in the database. The value here is some cut'n'paste from http://en.wikipedia.org/wiki/Thai_alphabet in case it gets mangled here. CREATE TABLE utf8test (id INTEGER PRIMARY KEY NOT NULL, str VARCHAR2(100)); INSERT INTO utf8test VALUES (1, 'อักษรไทย'); COMMIT; Now back to Perl where I can use the following trivial script (lets call it dbiselect.pl): #!/usr/bin/perl use DBI; #use Encode; $dbh=DBI->connect("dbi:Oracle:ORADB1", "someuser", "somepassword", { RaiseError=>1 }); foreach my $col ($dbh->selectrow_array($ARGV[0])) { # Encode::_utf8_off($col); print utf8::is_utf8($col)?"utf\n":"bin\n"; print $col; print "\n"; } $dbh->disconnect(); Running this with ./dbiselect.pl "select str from utf8test where id=1" yields: utf Wide character in print at ./dbiselect line 9. อักษรไทย The "Wide character in print" warning is obviously not good. It will happen if UTF variables are serialzed without encoding them properly first to some binary representation (like a UTF-8 octet stream). I'm not using binmode(STDOUT, ':utf8') to autmatically encode, hence it will complain about wide characters in print. If I explicitly do a print Encode::encode("UTF-8", $col); or use binmode in the above script I end up with double UTF-8 encoded data and broken output. If I uncomment the use Encode; and Encode::_utf8_off($col) lines I get correct output. bin อักษรไทย For comparison purposes, the same test script on DB2 + DBD::DB2 produces the final result with having to fudge the UTF flag with Encode::_utf8_off($col). My conclusion is that the data I get back from DBD::Oracle has an internal representation that is a UTF-8 encoded octet stream, but the Perl UTF flag is set. This is an erroneous state. It should either be a UTF-8 encoded octet stream and flagged as binary, or be a UTF-X string and flagged as UTF such that when encoded to UTF-8 produces the correct octet stream. Please advice and let me know if you find flaws in my reasoning. Kind regards, Marcus

Mon Jul 22 10:37:11 2019 bohica [...] ntlworld.com - Correspondence added

On Wed Dec 03 12:33:40 2014, bergner wrote: Show quoted text

> I'm curious to know what became of this. I'm seeing the following > strange UTF-8 behaviour on Oracle 11.2 and 12 on Linux x86_64, > terminal encoding is set to UTF-8. Setup is pretty much equivalent on > 11.2 vs. 12 except version number in Oracle paths. > > LC_CTYPE=en_US.UTF-8 > LANG=en_US.UTF-8 > NLS_LANG=.AL32UTF8 > ORACLE_SID=ORADB1 > ORACLE_BASE=/opt/oracle/app/oracle > ORACLE_HOME=/opt/oracle/app/oracle/product/12cR1 > LD_LIBRARY_PATH=/opt/oracle/app/oracle/product/12cR1/lib > > I've tried it with the following Perl, DBI, DBD::Oracle combination. > Same results on all of them. > > Perl 5.10.1 + DBI 1.609 + DBD::Oracle 1.56 > Perl 5.18.4 + DBI 1.630 + DBD::Oracle 1.68 > Perl 5.18.4 + DBI 1.630 + DBD::Oracle 1.75_2 > > Database NLS parameters as follows: > > NLS_CHARACTERSET AL32UTF8 > NLS_NCHAR_CHARACTERSET UTF8 > > On the Oracle 12c system NLS_NCHAR_CHARACTERSET is set to AL16UTF16 > but it really shouldn't matter since I'm only using regular VARCHAR2. > > Create a simple table in sqlplus and insert one row. I do not have any > corrupt data in the database. The value here is some cut'n'paste from > http://en.wikipedia.org/wiki/Thai_alphabet in case it gets mangled > here. > > CREATE TABLE utf8test (id INTEGER PRIMARY KEY NOT NULL, str > VARCHAR2(100)); > INSERT INTO utf8test VALUES (1, 'อักษรไทย'); > COMMIT; > > Now back to Perl where I can use the following trivial script (lets > call it dbiselect.pl): > > #!/usr/bin/perl > use DBI; > #use Encode; > $dbh=DBI->connect("dbi:Oracle:ORADB1", "someuser", "somepassword", { > RaiseError=>1 }); > foreach my $col ($dbh->selectrow_array($ARGV[0])) { > # Encode::_utf8_off($col); > print utf8::is_utf8($col)?"utf\n":"bin\n"; > print $col; > print "\n"; > } > $dbh->disconnect(); > > Running this with ./dbiselect.pl "select str from utf8test where id=1" > yields: > > utf > Wide character in print at ./dbiselect line 9. > อักษรไทย > > The "Wide character in print" warning is obviously not good. It will > happen if UTF variables are serialzed without encoding them properly > first to some binary representation (like a UTF-8 octet stream). I'm > not using binmode(STDOUT, ':utf8') to autmatically encode, hence it > will complain about wide characters in print. If I explicitly do a > print Encode::encode("UTF-8", $col); or use binmode in the above > script I end up with double UTF-8 encoded data and broken output. > > If I uncomment the use Encode; and Encode::_utf8_off($col) lines I get > correct output. > > bin > อักษรไทย > > For comparison purposes, the same test script on DB2 + DBD::DB2 > produces the final result with having to fudge the UTF flag with > Encode::_utf8_off($col). > > My conclusion is that the data I get back from DBD::Oracle has an > internal representation that is a UTF-8 encoded octet stream, but the > Perl UTF flag is set. This is an erroneous state. It should either be > a UTF-8 encoded octet stream and flagged as binary, or be a UTF-X > string and flagged as UTF such that when encoded to UTF-8 produces the > correct octet stream. > > Please advice and let me know if you find flaws in my reasoning. > > Kind regards, > > Marcus

This works perfectly for me: #!/usr/bin/perl use DBI; binmode(STDOUT, ":encoding(UTF-8)"); # <------------ you need this to avoid wide chr warnings #use Encode; # see ora_charset and ora_ncharset I added to the connect: $dbh=DBI->connect("dbi:Oracle:bet01d.bet.org:1521/bet01d", "bet", "b3t", { RaiseError=>1, ora_charset => 'AL32UTF8', ora_ncharset => 'AL32UTF8'}); foreach my $col ($dbh->selectrow_array('select str from utf8test where id=1')) { # Encode::_utf8_off($col); print utf8::is_utf8($col)?"utf\n":"bin\n"; print $col; print "\n"; } $dbh->disconnect(); $ perl xx.pl "select str from utf8test where id=1" utf อักษรไทย Martin -- Martin J. Evans Wetherby, UK

Mon Jul 22 10:42:08 2019 bohica [...] ntlworld.com - Correspondence added

I'm closing this for multiple reasons: 1. no effective input in 8 years. 2. it was always questionable as to whether it was a bug or not 3. we've moved to github. By all means open an issue on hithub if it is still a problem. Martin -- Martin J. Evans Wetherby, UK

Mon Jul 22 10:42:09 2019 bohica [...] ntlworld.com - Status changed from 'open' to 'resolved'

Bug #70819 for DBD-Oracle: AL32UTF8 Strings Should be Decoded, not just Flagged utf8

Preferred bug tracker