Subject: | problems processing Hebrew texts |
Hello,
I would like to report a possible bug while processing UTF8 encoded Hebrew text.
Please check out the following usage (you might need Hebrew fonts in order to read the Hebrew text):
use utf8;
use Text::Ngrams;
my $ng = Text::Ngrams->new( type => 'utf8' );
$ng->process_text('שלום עולם!');
print $ng->to_string;
Unfortunately, the output suggest that there was no text (or an empty string) was in the input text, as follows:
BEGIN OUTPUT BY Text::Ngrams version 1.7
1-GRAMS (total count: 0)
FIRST N-GRAM:
LAST N-GRAM:
------------------------
2-GRAMS (total count: 0)
FIRST N-GRAM:
LAST N-GRAM:
------------------------
3-GRAMS (total count: 0)
FIRST N-GRAM:
LAST N-GRAM:
------------------------
END OUTPUT BY Text::Ngrams
I'm using Perl 5.8.6:
Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
Platform:
osname=linux, osvers=2.2.17, archname=i686-linux-thread-multi
uname='linux gimlet 2.2.17 #1 sun jun 25 09:24:41 est 2000 i686 unknown '
config_args='-ders -Dcc=gcc -Accflags=-DNO_HASH_SEED -Dusethreads -Duseithreads -Ud_sigsetjmp -Uinstallusrbinperl -Ulocincpth= -Uloclibpth= -Duselargefiles -Uusemallocwrap -Dinc_version_list=5.8.5/$archname 5.8.5 5.8.4/$archname 5.8.4 5.8.3/$archname 5.8.3 5.8.2/$archname 5.8.2 5.8.1/$archname 5.8.1 5.8.0/$archname 5.8.0 -Duseshrplib -Dprefix=/usr/local/ActivePerl-5.8 -Dcf_by=ActiveState -Dcf_email=support@ActiveState.com'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DNO_HASH_SEED -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DNO_HASH_SEED -fno-strict-aliasing -pipe'
ccversion='', gccversion='2.95.2 20000220 (Debian GNU/Linux)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='gcc', ldflags =''
libpth=/lib /usr/lib /usr/local/lib
libs=-lnsl -lndbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lposix
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc -lposix
libc=/lib/libc-2.1.3.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.1.3'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/local/ActivePerl-5.8/lib/5.8.6/i686-linux-thread-multi/CORE'
cccdlflags='-fpic', lddlflags='-shared'
Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
Locally applied patches:
ActivePerl Build 811
21540 Fix backward-compatibility issues in if.pm
23565 Wrong MANIFEST.SKIP
Built under linux
Compiled at Dec 5 2004 07:09:45
@INC:
/usr/local/ActivePerl-5.8/lib/5.8.6/i686-linux-thread-multi
/usr/local/ActivePerl-5.8/lib/5.8.6
/usr/local/ActivePerl-5.8/lib/site_perl/5.8.6/i686-linux-thread-multi
/usr/local/ActivePerl-5.8/lib/site_perl/5.8.6
/usr/local/ActivePerl-5.8/lib/site_perl
.