Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Mail-SpamAssassin CPAN distribution.

Report information
The Basics
Id: 18405
Status: resolved
Priority: 0/
Queue: Mail-SpamAssassin

People
Owner: Nobody in particular
Requestors: NIKC [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Subject: [PATCH] Reporting URLs that triggered rules
When a URI matching rule fires, I need to know which URI in the message triggered the match. In theory, this is simple -- just have multiple URI rules, log the rule name that fired, and then go from the rule name back to the URI. In practice, this fails to scale. I have ~ 9000 URI matching rules. When loaded, Perl + SA takes up ~ 75MB of RSS, and scanning is significantly slowed. If I use Regexp::Assemble to turn these URI regexps in to one great big (70KB!) regexp memory usage drops to around 36MB, and scanning time is halved. But if I do this there's only one URI rule. So when management say "Why was this message blocked?" I can no longer tell them the details they've grown accustomed to hearing (i.e., which URL triggered the block). This also makes it difficult to track down issues where a URL block is overzealous, and is blocking legitimate messages. So... attached is a patch that stores the names of each URI that's found by a rule, and provides a get_names_of_uris_hit() method to return this information. I can then use this to log the URLs that triggered my single URI matching rule. Best of both worlds. The patch is against 2.63 I'm afraid, since that's what I have handy. Forward porting to 3.1.1 should be trivial.
Subject: sa-uri.diff.txt
Index: lib/Mail/SpamAssassin/PerMsgStatus.pm =================================================================== --- lib/Mail/SpamAssassin/PerMsgStatus.pm (revision 14687) +++ lib/Mail/SpamAssassin/PerMsgStatus.pm (working copy) @@ -68,6 +68,7 @@ 'test_logs' => '', 'test_names_hit' => [ ], 'subtest_names_hit' => [ ], + 'uris_hit' => [ ], 'tests_already_hit' => { }, 'hdr_cache' => { }, 'rule_errors' => 0, @@ -364,6 +365,21 @@ ########################################################################### +=item @list = $status->get_names_of_uris_hit () + +After a mail message has been checked, this method can be called. It will +return a list of all the URIs that were hit by rules. + +=cut + +sub get_names_of_uris_hit { + my ($self) = @_; + + return @{$self->{uris_hit}}; +} + +########################################################################### + =item $list = $status->get_names_of_subtests_hit () After a mail message has been checked, this method can be called. It will @@ -1836,7 +1852,7 @@ foreach ( @_ ) { '.$self->hash_line_for_rule($rulename).' if ('.$pat.') { - $self->got_uri_pattern_hit (q{'.$rulename.'}); + $self->got_uri_pattern_hit (q{'.$rulename.'}, $_); '. $self->ran_rule_debug_code ($rulename,"uri test", 4) . ' } } @@ -2315,12 +2331,13 @@ } sub got_uri_pattern_hit { - my ($self, $rulename) = @_; + my ($self, $rulename, $uri) = @_; # only allow each test to hit once per mail # TODO: Move this into the rule matcher return if (defined $self->{tests_already_hit}->{$rulename}); + push @{$self->{uris_hit}}, $uri; $self->got_hit ($rulename, 'URI: '); }
On Tue Mar 28 07:02:46 2006, NIKC wrote: Show quoted text
> When a URI matching rule fires, I need to know which URI in the message > triggered the match.
Hi, Thanks for the ticket. However, we don't use the CPAN RT to track issues/requests for the SpamAssassin code. Please open a ticket at http://bugzilla.spamassassin.org/ and we can go from there. :)