Subject: | Skips From-line longer than 90 chars starting just before a chunk |
Date: | Wed, 30 Oct 2019 18:25:09 +0800 |
To: | bug-Mail-Mbox-MessageParser [...] rt.cpan.org |
From: | Peter Nowee <peter [...] peternowee.com> |
Hi,
I noticed that a mail with a From-line (the line that marks the start
of another email in an mbox) longer than 90 characters starting just
before a multiple of MessageParser's `read_chunk_size` (default
20000), at least if it is the last mail in the file, possibly also at
other places, is not seen by grepmail.
Attached are two anonymized mbox files, mostly identical. They both
contain two emails:
$ grep -c '^From ' *mbox
right--last-mail-after-chunk.mbox:2
wrong--last-mail-before-chunk.mbox:2
The only difference between the two files is the length of the first
email and thereby the start of the last email. It starts at byte 19885
in the 'wrong' mbox and at byte 20048 in the 'right' mbox. This leads
to a wrong count by grepmail:
$ grepmail -r -e '.' *mbox
right--last-mail-after-chunk.mbox: 2
wrong--last-mail-before-chunk.mbox: 1
I found that increasing the value of `$backup_amount` in
`Mail/Mbox/MessageParser/Perl.pm`, line 216 (version 1.5111) solves
the problem for me.
The value 90 is hardcoded in some other places as well:
Mail/Mbox/MessageParser/Grep.pm:239: # believe the RFC says header lines can be at most 90 characters long.
Mail/Mbox/MessageParser/Grep.pm:241: qr/$Mail::Mbox::MessageParser::Config{'from_pattern'}/m,90))
Mail/Mbox/MessageParser/Perl.pm:215: # believe the RFC says header lines can be at most 90 characters long.
Mail/Mbox/MessageParser/Perl.pm:216: my $backup_amount = 90;
Mail/Mbox/MessageParser/Perl.pm:221: # 90-character lookback, but doesn't indicate the start of the next email.
anonymize_mailbox:74: # RFC says header lines can be at most 90 characters long.
anonymize_mailbox:75: my $search_position = length($READ_BUFFER) - 90;
I could not quickly find where the maximum of 90 could have come from,
but I think that a 998-character limit should apply today:
Show quoted text
> 2.1.1. Line Length Limits
> There are two limits that this specification places on the number of
> characters in a line. Each line of characters MUST be no more than
> 998 characters, and SHOULD be no more than 78 characters, excluding
> the CRLF.
-- [RFC 5322, Internet Message Format](https://tools.ietf.org/html/rfc5322)
I think it is not uncommon these days to have emails with a From-line
longer than 78 or 90 characters. For example, some automated mailers
seem to use dynamically generated addresses containing some long,
probably unique, string in the local part of their From-address.
I did not file any pull request, because I cannot really oversee the
possible side-effects of increasing the limit more than tenfold, for
example for very short emails.
Used versions:
- Debian 10.1 buster, Linux 4.19.67-2+deb10u1 on amd64 with:
- perl 5.28.1-6 (Debian package).
- libmail-mbox-messageparser-perl 1.5111-2 (Debian package).
- grepmail 5.3104-1 (Debian package), or
- grepmail 5.3111 (coppit/grepmail master at commit 3fc994a of
2018-07-12).
- Debian 9.11 stretch, Linux 4.9.189-3+deb9u1 on amd64 with:
- perl 5.24.1-3+deb9u5 (Debian package).
- libmail-mbox-messageparser-perl 1.5105-1 (Debian package).
- grepmail 5.3033-8 (Debian package).
Hope this helps and thank you for your work!
Best regards,
Peter Nowee
Message body not shown because it is not plain text.
Message body not shown because it is not plain text.