Skip Menu |

This queue is for tickets about the Mail-Mbox-MessageParser CPAN distribution.

Report information
The Basics
Id: 130862
Status: new
Priority: 0/
Queue: Mail-Mbox-MessageParser

People
Owner: Nobody in particular
Requestors: peter [...] peternowee.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Skips From-line longer than 90 chars starting just before a chunk
Date: Wed, 30 Oct 2019 18:25:09 +0800
To: bug-Mail-Mbox-MessageParser [...] rt.cpan.org
From: Peter Nowee <peter [...] peternowee.com>
Hi, I noticed that a mail with a From-line (the line that marks the start of another email in an mbox) longer than 90 characters starting just before a multiple of MessageParser's `read_chunk_size` (default 20000), at least if it is the last mail in the file, possibly also at other places, is not seen by grepmail. Attached are two anonymized mbox files, mostly identical. They both contain two emails: $ grep -c '^From ' *mbox right--last-mail-after-chunk.mbox:2 wrong--last-mail-before-chunk.mbox:2 The only difference between the two files is the length of the first email and thereby the start of the last email. It starts at byte 19885 in the 'wrong' mbox and at byte 20048 in the 'right' mbox. This leads to a wrong count by grepmail: $ grepmail -r -e '.' *mbox right--last-mail-after-chunk.mbox: 2 wrong--last-mail-before-chunk.mbox: 1 I found that increasing the value of `$backup_amount` in `Mail/Mbox/MessageParser/Perl.pm`, line 216 (version 1.5111) solves the problem for me. The value 90 is hardcoded in some other places as well: Mail/Mbox/MessageParser/Grep.pm:239: # believe the RFC says header lines can be at most 90 characters long. Mail/Mbox/MessageParser/Grep.pm:241: qr/$Mail::Mbox::MessageParser::Config{'from_pattern'}/m,90)) Mail/Mbox/MessageParser/Perl.pm:215: # believe the RFC says header lines can be at most 90 characters long. Mail/Mbox/MessageParser/Perl.pm:216: my $backup_amount = 90; Mail/Mbox/MessageParser/Perl.pm:221: # 90-character lookback, but doesn't indicate the start of the next email. anonymize_mailbox:74: # RFC says header lines can be at most 90 characters long. anonymize_mailbox:75: my $search_position = length($READ_BUFFER) - 90; I could not quickly find where the maximum of 90 could have come from, but I think that a 998-character limit should apply today: Show quoted text
> 2.1.1. Line Length Limits > There are two limits that this specification places on the number of > characters in a line. Each line of characters MUST be no more than > 998 characters, and SHOULD be no more than 78 characters, excluding > the CRLF.
-- [RFC 5322, Internet Message Format](https://tools.ietf.org/html/rfc5322) I think it is not uncommon these days to have emails with a From-line longer than 78 or 90 characters. For example, some automated mailers seem to use dynamically generated addresses containing some long, probably unique, string in the local part of their From-address. I did not file any pull request, because I cannot really oversee the possible side-effects of increasing the limit more than tenfold, for example for very short emails. Used versions: - Debian 10.1 buster, Linux 4.19.67-2+deb10u1 on amd64 with: - perl 5.28.1-6 (Debian package). - libmail-mbox-messageparser-perl 1.5111-2 (Debian package). - grepmail 5.3104-1 (Debian package), or - grepmail 5.3111 (coppit/grepmail master at commit 3fc994a of 2018-07-12). - Debian 9.11 stretch, Linux 4.9.189-3+deb9u1 on amd64 with: - perl 5.24.1-3+deb9u5 (Debian package). - libmail-mbox-messageparser-perl 1.5105-1 (Debian package). - grepmail 5.3033-8 (Debian package). Hope this helps and thank you for your work! Best regards, Peter Nowee

Message body not shown because it is not plain text.

Message body not shown because it is not plain text.

Subject: Re: [rt.cpan.org #130862] Skips From-line longer than 90 chars starting just before a chunk
Date: Wed, 30 Oct 2019 19:59:05 +0800
To: Bugs in Mail-Mbox-MessageParser via RT <bug-Mail-Mbox-MessageParser [...] rt.cpan.org>
From: Peter Nowee <peter [...] peternowee.com>
Correction on the limit of 998 I mentioned just now: The From-line that separates emails in an mbox is actually not part of the email itself, so probably not governed by RFC 5322 after all. RFC 4155 (The application/mbox Media Type) defines the mbox format, but does not seem to say anything about the maximum length of the From- line. It does say that the From-line consists of the email address and timestamp: https://tools.ietf.org/html/rfc4155 Here is some discussion about the maximum length of an email address, which is seems to be 254 characters, although others say 320: https://stackoverflow.com/questions/386294/what-is-the-maximum-length-of-a-valid-email-address The `from_pattern` in `Mail/Mbox/MessageParser/Config.pm` also mentions a possible additional string in the From-line for smail compatibility. Not sure how long that can be. So, maybe the mbox From-line will never reach 998 characters, but a safe choice would still be probably be in the hundreds (400 or more). Regards, Peter Nowee