Skip Menu |

This queue is for tickets about the Lingua-EN-Sentence CPAN distribution.

Report information
The Basics
Id: 104419
Status: resolved
Priority: 0/
Queue: Lingua-EN-Sentence

People
Owner: Nobody in particular
Requestors: NGLENN [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 0.28



Subject: [PATCH] make abbreviation processing faster
Currently the code loops over each abbreviation, creating a regex for each and then processing the text. I used about 250 abbreviations for my data (I just went to Wikipedia and grabbed all of the abbreviations commonly used in this language), and processing became very slow. The attached patch creates and caches a regex that processes all of the registered abbreviations at the same time. It improved performance significantly for me.
Subject: FastAbbrv.patch
From aa202b55f5d58fbe8442b64ccad19953d5a26a00 Mon Sep 17 00:00:00 2001 From: Nathan Glenn <nathan.g.glenn@atr-trek.co.jp> Date: Thu, 14 May 2015 09:55:39 +0900 Subject: [PATCH] modified --- Sentence.pm | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/Sentence.pm b/Sentence.pm index eacd582..d8df532 100644 --- a/Sentence.pm +++ b/Sentence.pm @@ -219,7 +219,8 @@ my @MISC = qw(no esp); my @LATIN = qw(vs etc al ibid sic); our @ABBREVIATIONS = (@PEOPLE, @TITLE_SUFFIXES, @ARMY, @INSTITUTES, @COMPANIES, @PLACES, @MONTHS, @MISC, @LATIN ); - +my $abbreviation_regex; +_set_abbreviations_regex(); #============================================================================== # @@ -250,6 +251,7 @@ sub get_sentences { #------------------------------------------------------------------------------ sub add_acronyms { push @ABBREVIATIONS, @_; + _set_abbreviations_regex(); } #------------------------------------------------------------------------------ @@ -264,6 +266,7 @@ sub get_acronyms { #------------------------------------------------------------------------------ sub set_acronyms { @ABBREVIATIONS=@_; + _set_abbreviations_regex(); } #------------------------------------------------------------------------------ @@ -282,7 +285,9 @@ sub set_EOS { cluck "Won't set \$EOS to undefined value!\n"; return $EOS; } - return $EOS = $new_EOS; + $EOS = $new_EOS; + _set_abbreviations_regex(); + return $EOS; } #------------------------------------------------------------------------------ @@ -334,6 +339,13 @@ sub set_locale { # #============================================================================== +# save some time by pre-compiling a regex used for working with abbreviations +sub _set_abbreviations_regex { + my $abbreviations = join '|', @ABBREVIATIONS; + $abbreviation_regex = qr[(\b(?:$abbreviations)$PAP\s)$EOS]is; + return; +} + ## Please email me any suggestions for optimizing these RegExps. sub remove_false_end_of_sentence { my ($marked_segment) = @_; @@ -351,7 +363,7 @@ sub remove_false_end_of_sentence { # fix "." "?" "!" $marked_segment=~s/(['"]$P['"]\s+)$EOS/$1/sg; ## fix where abbreviations exist - foreach (@ABBREVIATIONS) { $marked_segment=~s/(\b$_$PAP\s)$EOS/$1/isg; } + $marked_segment=~s/$abbreviation_regex/$1/g; # don't break after quote unless its a capital letter. $marked_segment=~s/(["']\s*)$EOS(\s*[[:lower:]])/$1$2/sg; -- 1.9.5.msysgit.1
Patch has been applied, thanks for adding it.
Thanks for responding and applying the patch. I noticed that the slow line that the patch removed is still in the distribution, though. You now have this: foreach (@ABBREVIATIONS) { $marked_segment=~s/(\b$_$PAP\s)$EOS/$1/isg; } $marked_segment=~s/$abbreviation_regex/$1/g; That second line does the same as the first, and was meant to replace it. Now it's doing the work twice!
Subject: Re: [rt.cpan.org #104419] [PATCH] make abbreviation processing faster
Date: Sun, 24 May 2015 10:59:36 +1000
To: bug-Lingua-EN-Sentence [...] rt.cpan.org
From: Kim Ryan <kimryan [...] bigpond.net.au>
OK then . I tried using patch but it rejected most of the changes, so did it manually in the end. Can you check this version? I'm not sure how to test for the slowness. Regards, Kim On 23/05/2015 1:55 PM, Nathan Gary Glenn via RT wrote: Show quoted text
> Queue: Lingua-EN-Sentence > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=104419 > > > Thanks for responding and applying the patch. > I noticed that the slow line that the patch removed is still in the distribution, though. You now have this: > > foreach (@ABBREVIATIONS) { $marked_segment=~s/(\b$_$PAP\s)$EOS/$1/isg; } > $marked_segment=~s/$abbreviation_regex/$1/g; > > That second line does the same as the first, and was meant to replace it. Now it's doing the work twice!

Message body is not shown because sender requested not to inline it.

Subject: Re: [rt.cpan.org #104419] [PATCH] make abbreviation processing faster
Date: Sun, 24 May 2015 16:24:07 +0900
To: bug-Lingua-EN-Sentence [...] rt.cpan.org
From: Nathan Glenn <garfieldnate [...] gmail.com>
Sure, I can check it. You can send it to me or upload a dev release. On Sun, May 24, 2015 at 9:59 AM, Kim Ryan via RT < bug-Lingua-EN-Sentence@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=104419 > > > OK then . I tried using patch but it rejected most of the changes, so > did it manually in the end. Can you check this version? I'm not sure how > to test for the slowness. > > Regards, > > Kim > > On 23/05/2015 1:55 PM, Nathan Gary Glenn via RT wrote:
> > Queue: Lingua-EN-Sentence > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=104419 > > > > > Thanks for responding and applying the patch. > > I noticed that the slow line that the patch removed is still in the
> distribution, though. You now have this:
> > > > foreach (@ABBREVIATIONS) { $marked_segment=~s/(\b$_$PAP\s)$EOS/$1/isg; } > > $marked_segment=~s/$abbreviation_regex/$1/g; > > > > That second line does the same as the first, and was meant to replace
> it. Now it's doing the work twice! > > >
Subject: Re: [rt.cpan.org #104419] [PATCH] make abbreviation processing faster
Date: Mon, 25 May 2015 11:10:34 +1000
To: bug-Lingua-EN-Sentence [...] rt.cpan.org
From: Kim Ryan <kimryan [...] bigpond.net.au>
The new code is there as an attachment to my last reply. Regards, Kim On 24/05/2015 5:24 PM, garfieldnate@gmail.com via RT wrote: Show quoted text
> Queue: Lingua-EN-Sentence > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=104419 > > > Sure, I can check it. You can send it to me or upload a dev release. > > On Sun, May 24, 2015 at 9:59 AM, Kim Ryan via RT < > bug-Lingua-EN-Sentence@rt.cpan.org> wrote: >
>> <URL: https://rt.cpan.org/Ticket/Display.html?id=104419 > >> >> OK then . I tried using patch but it rejected most of the changes, so >> did it manually in the end. Can you check this version? I'm not sure how >> to test for the slowness. >> >> Regards, >> >> Kim >> >> On 23/05/2015 1:55 PM, Nathan Gary Glenn via RT wrote:
>>> Queue: Lingua-EN-Sentence >>> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=104419 > >>> >>> Thanks for responding and applying the patch. >>> I noticed that the slow line that the patch removed is still in the
>> distribution, though. You now have this:
>>> foreach (@ABBREVIATIONS) { $marked_segment=~s/(\b$_$PAP\s)$EOS/$1/isg; } >>> $marked_segment=~s/$abbreviation_regex/$1/g; >>> >>> That second line does the same as the first, and was meant to replace
>> it. Now it's doing the work twice! >> >> >>
Oops, sorry. It works great. Nice and fast! Thank you. On Sun May 24 21:10:55 2015, kimryan@bigpond.net.au wrote: Show quoted text
> The new code is there as an attachment to my last reply.