Subject: | [PATCH] Reporting URLs that triggered rules |
When a URI matching rule fires, I need to know which URI in the message
triggered the match.
In theory, this is simple -- just have multiple URI rules, log the rule
name that fired, and then go from the rule name back to the URI.
In practice, this fails to scale. I have ~ 9000 URI matching rules.
When loaded, Perl + SA takes up ~ 75MB of RSS, and scanning is
significantly slowed.
If I use Regexp::Assemble to turn these URI regexps in to one great big
(70KB!) regexp memory usage drops to around 36MB, and scanning time is
halved.
But if I do this there's only one URI rule. So when management say "Why
was this message blocked?" I can no longer tell them the details they've
grown accustomed to hearing (i.e., which URL triggered the block). This
also makes it difficult to track down issues where a URL block is
overzealous, and is blocking legitimate messages.
So... attached is a patch that stores the names of each URI that's found
by a rule, and provides a get_names_of_uris_hit() method to return this
information. I can then use this to log the URLs that triggered my
single URI matching rule. Best of both worlds.
The patch is against 2.63 I'm afraid, since that's what I have handy.
Forward porting to 3.1.1 should be trivial.
Subject: | sa-uri.diff.txt |
Index: lib/Mail/SpamAssassin/PerMsgStatus.pm
===================================================================
--- lib/Mail/SpamAssassin/PerMsgStatus.pm (revision 14687)
+++ lib/Mail/SpamAssassin/PerMsgStatus.pm (working copy)
@@ -68,6 +68,7 @@
'test_logs' => '',
'test_names_hit' => [ ],
'subtest_names_hit' => [ ],
+ 'uris_hit' => [ ],
'tests_already_hit' => { },
'hdr_cache' => { },
'rule_errors' => 0,
@@ -364,6 +365,21 @@
###########################################################################
+=item @list = $status->get_names_of_uris_hit ()
+
+After a mail message has been checked, this method can be called. It will
+return a list of all the URIs that were hit by rules.
+
+=cut
+
+sub get_names_of_uris_hit {
+ my ($self) = @_;
+
+ return @{$self->{uris_hit}};
+}
+
+###########################################################################
+
=item $list = $status->get_names_of_subtests_hit ()
After a mail message has been checked, this method can be called. It will
@@ -1836,7 +1852,7 @@
foreach ( @_ ) {
'.$self->hash_line_for_rule($rulename).'
if ('.$pat.') {
- $self->got_uri_pattern_hit (q{'.$rulename.'});
+ $self->got_uri_pattern_hit (q{'.$rulename.'}, $_);
'. $self->ran_rule_debug_code ($rulename,"uri test", 4) . '
}
}
@@ -2315,12 +2331,13 @@
}
sub got_uri_pattern_hit {
- my ($self, $rulename) = @_;
+ my ($self, $rulename, $uri) = @_;
# only allow each test to hit once per mail
# TODO: Move this into the rule matcher
return if (defined $self->{tests_already_hit}->{$rulename});
+ push @{$self->{uris_hit}}, $uri;
$self->got_hit ($rulename, 'URI: ');
}