Subject: | overlapping greedy submatches vulnerable to unlucky data |
Date: | Tue, 30 Jul 2013 17:31:33 -0400 |
To: | bug-Text-Autoformat [...] rt.cpan.org |
From: | Michael Hamlin <myrrhlin [...] gmail.com> |
howdy,
I ran into a case of Text::Autoformat behaving badly in production, and
tracked it down to this patch (made against latest version 1.669003):
463,464c463,465
< $newtext =~ /\s*([^\n]*)$/;
< $widow_okay = $para->{empty} || length($1) >= $args{widow};
---
Show quoted text
> (my $widow) = $newtext =~ /([^\n]*)$/;
> $widow =~ s/^\s+//;
> $widow_okay = $para->{empty} || length($widow) >= $args{widow};
this regex was taking over 9 minutes on a particularly bad email
we received with lots of tabs. we're (sadly) still running 5.8.8.
on CentOS boxen (eg GNU/Linux 2.6.18-194.8.1.el5 #1 SMP
Thu Jul 1 19:04:48 EDT 2010 x86_64)
the regex match m/\s*([^\n]*)$/ is problematic because spaces
and tabs can match either of the greedy submatches. this overlap
means lots of permutations and backtracking for the regex engine.
doing the two bits of logic separately (get the last line, strip off
leading space before determining its length) avoids the issue, at
the expense of an extra lexical.
i hope this report is helpful, and thank you for great tools!
michael