Bug #54424 for Regexp-Grammars: Regexp::Grammars is very slow

Mon Feb 08 18:36:23 2010 alexeiz [...] gmail.com - Ticket created

Subject:

Regexp::Grammars is very slow

I'm trying to parse "{}" blocks in a C++ header file (10K of size) using the following regular expression: qr{ <curly_block> <rule: curly_block> \{ (?: <curly_block> | [^{}] )* \} }xms; Unfortunately it takes way to long. Here's the breakdown with dprofpp: Total Elapsed Time = 25.65577 Seconds User+System Time = 25.38577 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 98.6 25.05 25.151 3 8.3505 8.3837 Converter::__ANON__ 0.32 0.082 0.082 6142 0.0000 0.0000 Regexp::Grammars::_open_log 0.16 0.040 0.049 10 0.0040 0.0049 Converter::BEGIN 0.12 0.030 0.039 6 0.0050 0.0065 FindBin::BEGIN 0.08 0.020 0.030 5 0.0040 0.0060 Scalar::Util::PP::BEGIN 0.08 0.020 0.079 5 0.0040 0.0158 Error::BEGIN 0.08 0.020 0.255 11 0.0018 0.0232 main::BEGIN 0.07 0.019 0.019 2 0.0097 0.0093 Regexp::Grammars::_translate_subru le_calls 0.04 0.010 0.010 1 0.0100 0.0100 B::bootstrap 0.04 0.010 0.010 1 0.0100 0.0100 File::Spec::Unix::path 0.04 0.010 0.010 1 0.0100 0.0100 main::find_comp_files 0.04 0.010 0.010 3 0.0033 0.0033 vars::BEGIN 0.04 0.010 0.010 3 0.0033 0.0033 Config::BEGIN 0.04 0.010 0.020 2 0.0050 0.0100 lib::BEGIN 0.04 0.010 0.010 6 0.0017 0.0017 Exporter::as_heavy The equivalent plain vanilla 5.10 regexp is shown below and it produces the result much faster. qr{ (?<text> (?&curly_block)) (?(DEFINE) (?<curly_block> \{ (?: (?&curly_block) | [^{}] )* \} ) ) }xms; Total Elapsed Time = 0.493133 Seconds User+System Time = 0.243133 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 16.4 0.040 0.049 11 0.0036 0.0045 Converter::BEGIN 16.4 0.040 0.079 5 0.0080 0.0158 Error::BEGIN 16.4 0.040 0.255 11 0.0036 0.0232 main::BEGIN 8.23 0.020 0.039 6 0.0033 0.0065 FindBin::BEGIN 8.23 0.020 0.029 28 0.0007 0.0010 Fatal::BEGIN 4.11 0.010 0.010 5 0.0020 0.0020 DynaLoader::dl_load_file 4.11 0.010 0.010 3 0.0033 0.0033 vars::BEGIN 4.11 0.010 0.010 4 0.0025 0.0025 Data::Dumper::BEGIN 4.11 0.010 0.010 3 0.0033 0.0033 Config::BEGIN 4.11 0.010 0.018 3 0.0033 0.0062 Converter::__ANON__ 4.11 0.010 0.010 2 0.0050 0.0050 Regexp::Grammars::_translate_subru le_call 4.11 0.010 0.010 9 0.0011 0.0011 Tie::RefHash::BEGIN 4.11 0.010 0.030 4 0.0025 0.0074 XSLoader::load 4.11 0.010 0.019 1 0.0099 0.0192 FindBin::init 4.11 0.010 0.010 66 0.0001 0.0001 File::Spec::Unix::canonpath 25 seconds with Regexp::Grammar vs 0.5 seconds with stock perl regex. And this is on a pretty simple regular expression. Anything more complex than this and Regexp::Grammars doesn't even terminate. This is a showstopper for me. perl -V: Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=solaris, osvers=2.9, archname=sun4-solaris-64int uname='sunos sundev32 5.9 generic_118558-18 sun4u sparc sunw,sun- fire ' config_args='' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='/opt/SUNWspro/bin/cc', ccflags ='-I/bbs/opt/include - I/opt/swt/include -I/usr/local/include -D_LARGEFILE_SOURCE - D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O', cppflags='-I/bbs/opt/include -I/opt/swt/include -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' ccversion='Sun C 5.5 Patch 112760-09 2004/03/31', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=87654321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='/opt/SUNWspro/bin/cc', ldflags ='-L/bbs/opt/lib -L/opt/swt/lib - L/usr/lib -L/usr/ccs/lib -L/bb/util/common/studio8-v3/SUNWspro/prod/lib -L/usr/local/lib ' libpth=/bbs/opt/lib /opt/swt/lib /usr/lib /usr/ccs/lib /bb/util/common/studio8-v3/SUNWspro/prod/lib /usr/local/lib libs=-lsocket -lnsl -lgdbm -ldl -lm -lc perllibs=-lsocket -lnsl -ldl -lm -lc libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-KPIC', lddlflags='-G -L/bbs/opt/lib -L/opt/swt/lib - L/usr/lib -L/usr/ccs/lib -L/bb/util/common/studio8-v3/SUNWspro/prod/lib -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_USE_SAFE_PUTENV USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO Built under solaris Compiled at Jun 17 2008 17:56:25 %ENV: PERL5LIB="/bb/util/common/perlmod/lib/site_perl" @INC: /bb/util/common/perlmod/lib/site_perl /bbs/opt/perl-5.10.0/lib/5.10.0/sun4-solaris-64int /bbs/opt/perl-5.10.0/lib/5.10.0 /bbs/opt/perl-5.10.0/lib/site_perl/5.10.0/sun4-solaris-64int /bbs/opt/perl-5.10.0/lib/site_perl/5.10.0 .

Mon Feb 08 19:14:13 2010 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #54424] Regexp::Grammars is very slow
Date:	Tue, 9 Feb 2010 11:12:32 +1100
To:	bug-Regexp-Grammars [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Show quoted text

> I'm trying to parse "{}" blocks in a C++ header file (10K of size) using > the following regular expression: > > qr{ > <curly_block> > > <rule: curly_block> > \{ (?: <curly_block> | [^{}] )* \} > }xms; > > Unfortunately it takes way to long.

The performance might be improved by writing the curly_block rule more efficiently: <token: curly_block> \{ (?: [^{}]++ | <.curly_block> )* \} BTW, the vanilla regex you compared it with is not equivalent to the regex grammar you used. The equivalent regex would be: qr{ (?<text> (?&curly_block)) (?(DEFINE) (?<curly_block> \s* \{ \s* (?: (?&curly_block) | [^{}] )* \s* \} ) ) }xms; Hope this helps, Damian

Mon Feb 08 19:14:14 2010 The RT System itself - Status changed from 'new' to 'open'

Tue Feb 09 01:36:28 2010 alexeiz [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #54424] Regexp::Grammars is very slow
Date:	Tue, 9 Feb 2010 01:35:11 -0500
To:	bug-Regexp-Grammars [...] rt.cpan.org
From:	Alexei Zakharov <alexeiz [...] gmail.com>

Hi Damian, Thanks for you quick reply. I tried your suggested improvement. But it didn't make any difference in performance. The dprofpp profile stayed essentially the same as well with Converter::__ANON__ consuming 98% of the execution time. The equivalent plain regex that you provided is indeed a little slower than the original one. But not to the point of Regexp::Grammar slowness. Now the plain regex is only 20 times faster than Regexp::Grammar. I'm open to more suggestions. I can also probably provide you with a repro scenario. Although, I believe it should be pretty easy to repro. The key is a large input file (10K in my case). Thanks, Alexei On Mon, Feb 8, 2010 at 7:14 PM, damian@conway.org via RT < bug-Regexp-Grammars@rt.cpan.org> wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=54424 > >

> > I'm trying to parse "{}" blocks in a C++ header file (10K of size) using > > the following regular expression: > > > > qr{ > > <curly_block> > > > > <rule: curly_block> > > \{ (?: <curly_block> | [^{}] )* \} > > }xms; > > > > Unfortunately it takes way to long.

> > The performance might be improved by writing the curly_block rule more > efficiently: > > <token: curly_block> > \{ (?: [^{}]++ | <.curly_block> )* \} > > BTW, the vanilla regex you compared it with is not equivalent to the > regex grammar you used. The equivalent regex would be: > > qr{ > (?<text> (?&curly_block)) > > (?(DEFINE) > (?<curly_block> > \s* \{ \s* (?: (?&curly_block) | [^{}] )* \s* \} ) > ) > }xms; > > Hope this helps, > > Damian > >

Tue Feb 09 07:45:52 2010 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #54424] Regexp::Grammars is very slow
Date:	Tue, 9 Feb 2010 21:15:15 +1100
To:	bug-Regexp-Grammars [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Hi Alexei, Show quoted text

> Thanks for you quick reply. I tried your suggested improvement. But it > didn't make any difference in performance. > > I'm open to more suggestions. I can also probably provide you with a repro > scenario. Although, I believe it should be pretty easy to repro. The key > is a large input file (10K in my case).

So is there a reason why you need Regexp::Grammars (rather than just using vanilla 5.10 named rules)? After all, your example wasn't actually capturing the individual nested curly blocks it was parsing. Could you just use the 20x faster vanilla regex? Otherwise, if you really do need to build the full data structure quicker, there may not be anything you can do (except trying Parse::Yapp or Marpa instead). Regexp::Grammars is never going to compete with the pure regex engine, because its internals are implemented in Perl, whilst the regex engine's are implemented in C. Sorry I couldn't be of more help. Damian