Subject: | AL32UTF8 Strings Should be Decoded, not just Flagged utf8 |
Date: | Thu, 8 Sep 2011 13:30:55 -0700 |
To: | bug-DBD-Oracle [...] rt.cpan.org |
From: | "David E. Wheeler" <dwheeler [...] cpan.org> |
The DBD::Oracle docs state:
Show quoted text
> Any data returned from Oracle to DBD::Oracle in the AL32UTF8 character set will be marked as UTF-8 to ensure correct handling by Perl.
>
> For Oracle to return data in the AL32UTF8 character set the NLS_LANG or NLS_NCHAR environment variable must be set as described in the previous section.
I've recently written an app that connects to an Oracle 9 server with code like this:
BEGIN {
$ENV{NLS_LANG} = 'AMERICAN_AMERICA.AL32UTF8';
}
my $dbh = DBI->connect("dbi:Oracle:$sid", $username, $password, {
RaiseError => 1,
PrintError => 0,
AutoCommit => 0,
});
With this pattern, I text data fetched from the database does in fact have the utf8 flag enabled on the appropriate values. However, not all text data returned from the database is in fact UTF-8. So yes, I've told DBD::Oracle that the data should be UTF-8, but Oracle itself sometimes has other opinions.
The upshot is that I can end up with text values with the utf8 flag enabled but that contain invalid UTF-8. I suspect that this is because DBD::Oracle simply turns on that flag, rather than using Perl's Unicode API to *decode* the text.
In other words, I think that DBD::Oracle does the equivalent of this in C:
Encode::_utf8_on($value);
When what it should be doing is something like this:
$value = Encode::decode('UTF-8', $value, Encode::FB_CROAK);
This would ensure that the data was, in fact, UTF-8, and would die if it was not. It also would be better because Perl's internal representation, known as "utf8", is not the same as UTF-8 (it's more relaxed about what it considers UTF-8).
More details on Unicode gotchas are cataloged here:
http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
FWIW, I'm currently working around this bug by turning off the utf8 flag and decoding myself:
my @encodings = map { find_encoding $_ } qw(utf-8 cp1252 iso-8859-1);
if (defined $value) {
# DBD::Oracle turns on utf8 without validating.
Encode::_utf8_off($value);
# We need to decode the value.
my $found = 0;
for my $enc (@encodings) {
local $@;
eval { $value = $enc->decode($value, Encode::FB_CROAK) };
next if $@;
$found = 1;
last;
}
die "Cannot decode $value" unless $found;
}
Fortunately our strings are broken in particular ways (encoded as CP1252 or latin-1 when not properly UTF-8) so this is fine for me. But I do think that naïvely turning on the utf8 flag is not the proper way to handle UTF-8 values fetched from the database.
Best,
David