CC: | slaven [...] rezic.de, Tony Cook via RT <perlbug-followup [...] perl.org> |
Subject: | database UTF-8 documentation (was: [perl #126849] Wide character in subroutine entry, DB_File) |
Date: | Wed, 9 Dec 2015 15:42:12 -0800 |
To: | bug-DB_File [...] rt.cpan.org, bug-BerkeleyDB [...] rt.cpan.org |
From: | frederik [...] ofb.net |
Thank you Tony and Slaven for your replies.
I'm sending to bug-DB_File@rt.cpan.org and bug-BerkeleyDB@rt.cpan.org
as per instructions at rt.cpan.org.
The bug, to summarize what's below, is really a request for the
documentation of the DB_File and BerkeleyDB packages to explain the
situation with respect to UTF-8 support - namely the lack of special
support, how to interpret the "wide character in subroutine entry"
message, how to put filters on a database object to get UTF-8 to work
right.
I don't think any changes to the code are necessary, given what's been
said by Tony and Slaven.
Thanks again!
On Tue, Dec 08, 2015 at 07:50:07PM -0800, Tony Cook via RT wrote:
Show quoted text
> On Tue Dec 08 14:33:46 2015, frederik@ofb.net wrote:
On Wed, Dec 09, 2015 at 12:52:43PM -0800, slaven@rezic.de via RT wrote:
Show quoted text> > The following program produces the error "Wide character in subroutine
> > entry at ./bug-example line 23.". I guess it means that DB_File does
> > not support UTF-8. I notice that when using BerkeleyDB, it works. I
> > had some trouble debugging this and wanted to suggest some
> > improvements:
>
> BerkeleyDB simply isn't warning about the lack of UT8-8 support.
>
> If I add the following to then end of your code:
>
> my @keys = keys %h;
> print $keys[0] eq $ents[0] ? "match" : "no match";
>
> and uncomment the BerkeleyDB tie, you'll see that the key you supplied
> doesn't match the key that the database is storing.
>
> Luckily both BerkeleyDB and DB_File have a mechanism to automatically process
> both keys and values, for DB_File:
>
> use DBM_Filter;
> my $db = tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;
> $db->Filter_Key_Push('utf8');
>
> for BerkeleyDB:
>
> my $db = tie %h, "BerkeleyDB::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
> $db->filter_store_key(sub { utf8::encode($_) });
> $db->filter_fetch_key(sub { utf8::decode($_) });
>
> Here I'm only processing the keys, see the documentation on processing the values instead (or as well).
>
> (perldoc DBM_Filter claims to support BerkeleyDB, but doesn't appear to.)
>
> > 1. perldiag mentions "Wide character in %s" but not "Wide character in
> > subroutine entry". The description for the former talks about
> > filehandles and binmode, while "Wide character in subroutine entry"
> > seems to demand a use of encode(...). Perhaps the "subroutine enry"
> > version of the message should be described specially or separately in
> > perldiag.
>
> That warning is caused by the XS code for DB_File calling SvPVbyte(), and it
> happens that the entersub ("subroutine entry") op used to call the XS code
> is active at that point.
>
> I'm not sure explaining that would be useful to a normal user reading the documentation.
>
> > 2. I guess DB_File is a bit old, but I chose it because I don't need
> > any of the BerkeleyDB features like cursors, and I value backwards
> > compatibility. Perhaps the man page should mention that it doesn't
> > work with UTF-8, which would have changed my decision. Or the man page
> > could even mention that one needs to encode("utf-8", $_) on keys.
>
> > 3. Then again, DB_File could be updated to support UTF-8.
>
> DB_File is CPAN upstream and is maintained by the same author as BerkeleyDB.
>
> CPAN upstream issues should be reported upstream, see https://rt.cpan.org/Public/Dist/Display.html?Name=DB_File
>
> Tony
>
> Dana Uto 08. Pro 2015, 14:33:46, frederik@ofb.net reče:
On Wed, Dec 09, 2015 at 01:13:31PM -0800, slaven@rezic.de via RT wrote:
Show quoted text> >
> > This is a bug report for perl from frederik@ofb.net,
> > generated with the help of perlbug 1.40 running under perl 5.22.0.
> >
> >
> > -----------------------------------------------------------------
> > [Please describe your issue here]
> >
> > The following program produces the error "Wide character in subroutine
> > entry at ./bug-example line 23.". I guess it means that DB_File does
> > not support UTF-8. I notice that when using BerkeleyDB, it works. I
> > had some trouble debugging this and wanted to suggest some
> > improvements:
> >
> > 1. perldiag mentions "Wide character in %s" but not "Wide character in
> > subroutine entry". The description for the former talks about
> > filehandles and binmode, while "Wide character in subroutine entry"
> > seems to demand a use of encode(...). Perhaps the "subroutine enry"
> > version of the message should be described specially or separately in
> > perldiag.
> >
> > 2. I guess DB_File is a bit old, but I chose it because I don't need
> > any of the BerkeleyDB features like cursors, and I value backwards
> > compatibility. Perhaps the man page should mention that it doesn't
> > work with UTF-8, which would have changed my decision. Or the man page
> > could even mention that one needs to encode("utf-8", $_) on keys.
> >
> > 3. Then again, DB_File could be updated to support UTF-8.
> >
> > Thanks so much for a great programming language.
> >
> > #!/bin/perl
> >
> > use strict;
> > use utf8;
> > use BerkeleyDB;
> > use DB_File;
> > use Encode;
> >
> > $\ = "\n";
> >
> > my $dbf = "xx.db";
> > unlink $dbf;
> >
> > my %h;
> >
> > # tie %h, "BerkeleyDB::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
> > tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;
> >
> > my @ents;
> > # @ents = map {decode("utf-8", $_)} @ARGV;
> > @ents = decode("utf-8", encode("utf-8",'œ'));
> >
> > for(@ents) { $h{$_} = 1; }
> >
> > print join("\n", keys %h);
> >
> >
> >
> > [Please do not change anything below this line]
> > -----------------------------------------------------------------
> > ---
> > Flags:
> > category=core
> > severity=low
> > ---
> > Site configuration information for perl 5.22.0:
> >
> > Configured by builduser at Tue Jun 2 09:45:08 CEST 2015.
> >
> > Summary of my perl5 (revision 5 version 22 subversion 0)
> > configuration:
> >
> > Platform:
> > osname=linux, osvers=4.0.4-2-arch, archname=x86_64-linux-thread-
> > multi
> > uname='linux flo-64 4.0.4-2-arch #1 smp preempt fri may 22 03:05:23
> > utc 2015 x86_64 gnulinux '
> > config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64
> > -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-
> > size=4 -Dprefix=/usr -Dvendorprefix=/usr
> > -Dprivlib=/usr/share/perl5/core_perl
> > -Darchlib=/usr/lib/perl5/core_perl
> > -Dsitelib=/usr/share/perl5/site_perl
> > -Dsitearch=/usr/lib/perl5/site_perl
> > -Dvendorlib=/usr/share/perl5/vendor_perl
> > -Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl
> > -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl
> > -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl
> > -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-
> > needed,-z,relro -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro'
> > hint=recommended, useposix=true, d_sigaction=define
> > useithreads=define, usemultiplicity=define
> > use64bitint=define, use64bitall=define, uselongdouble=undef
> > usemymalloc=n, bincompat5005=undef
> > Compiler:
> > cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-
> > aliasing -pipe -fstack-protector-strong -I/usr/local/include
> > -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
> > optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-
> > strong --param=ssp-buffer-size=4',
> > cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing
> > -pipe -fstack-protector-strong -I/usr/local/include'
> > ccversion='', gccversion='5.1.0', gccosandvers=''
> > intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678,
> > doublekind=3
> > d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16,
> > longdblkind=3
> > ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
> > lseeksize=8
> > alignbytes=8, prototype=define
> > Linker and Libraries:
> > ld='cc', ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro
> > -fstack-protector-strong -L/usr/local/lib'
> > libpth=/usr/local/lib /usr/lib/gcc/x86_64-unknown-linux-
> > gnu/5.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib
> > /lib64 /usr/lib64
> > libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
> > -lgdbm_compat
> > perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
> > libc=libc-2.21.so, so=so, useshrplib=true, libperl=libperl.so
> > gnulibc_version='2.21'
> > Dynamic Linking:
> > dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
> > -Wl,-rpath,/usr/lib/perl5/core_perl/CORE'
> > cccdlflags='-fPIC', lddlflags='-shared -Wl,-O1,--sort-common,--as-
> > needed,-z,relro -L/usr/local/lib -fstack-protector-strong'
> >
> >
> > ---
> > @INC for perl 5.22.0:
> > /home/frederik/scripts-misc/perl
> > /home/frederik/.local/lib/perl5/x86_64-linux-thread-multi
> > /home/frederik/.local/lib/perl5
> > /usr/lib/perl5/site_perl
> > /usr/share/perl5/site_perl
> > /usr/lib/perl5/vendor_perl
> > /usr/share/perl5/vendor_perl
> > /usr/lib/perl5/core_perl
> > /usr/share/perl5/core_perl
> > .
> >
> > ---
> > Environment for perl 5.22.0:
> > HOME=/home/frederik
> > LANG=en_US.UTF-8
> > LANGUAGE (unset)
> > LD_LIBRARY_PATH=/home/frederik/.local/arch/x86_64/lib:/home/frederik/.local/lib:/usr/local/lib
> > LOGDIR (unset)
> > PATH=/home/frederik/.local/bin:/home/frederik/projects/mailproc:/home/frederik/scripts-
> > misc:/home/frederik/.local/arch/x86_64/bin:/usr/bin/core_perl:/usr/bin/vendor_perl:/usr/bin/site_perl:/usr/local/bin:/usr/local/sbin:/usr/bin
> > PERL5LIB=/home/frederik/scripts-
> > misc/perl:/home/frederik/.local/lib/perl5
> > PERL_BADLANG (unset)
> > PERL_LOCAL_LIB_ROOT=/home/frederik/.local/:/home/frederik/.local/:/home/frederik/.local/:/home/frederik/.local/
> > PERL_MB_OPT=--install_base "/home/frederik/.local/"
> > PERL_MM_OPT=INSTALL_BASE=/home/frederik/.local/
> > SHELL=/bin/zsh
>
> DB_File (and the underlying berkeley db engine, I guess) can handle only binary (or octets or latin1) data. There's no way to specify a specific encoding, especially for "wide characters". But if you know that you have to store data in the utf8 encoding, then you can define "DBM filters" which do the translation from wide characters into octets and vice versa automatically:
>
> for my $filter (qw(filter_store_key filter_store_value)) {
> (tied %h)->$filter(sub { $_ = encode('utf-8', $_) });
> }
> for my $filter (qw(filter_fetch_key filter_fetch_value)) {
> (tied %h)->$filter(sub { $_ = decode('utf-8', $_) });
> }
>
> Maybe something like this could be added to the DB_File documentation.
>
> Maybe there's also room for a tiny (CPAN) module, say DB_File::utf8, which does something like this automatically.
>
> Regards,
> Slaven
>
> Dana Sri 09. Pro 2015, 12:52:43, slaven@rezic.de reče:
> > Dana Uto 08. Pro 2015, 14:33:46, frederik@ofb.net reče:
> [...]
> > Maybe something like this could be added to the DB_File documentation.
> >
> > Maybe there's also room for a tiny (CPAN) module, say DB_File::utf8,
> > which does something like this automatically.
>
> Missed Tony's answer, and of course, DBM_Filter::utf8 is there and good enough.
>
> Regards,
> Slaven
>