Subject: | Parser has trouble with Mbox messages which contain lines starting with 'From ' |
I'm using perl v5.8.4 on Debian sarge. This problem happens with both the 2.055 version in Debian's libmail-box-perl, which notably does not include the C parser, and a fresh 2.059 install from CPAN.
I've been using Mail::Box to convert mbox archives to Maildirs as part of our IMAP migration process. Some users have messages where the body contains a line which starts with 'From ' and ends with ', year-like-number'; these messages will be misparsed and seen as two messages - one with the real headers and the body up to the line before the From and a second message containing the remainder of the body and no headers.
Here's an example message which will be parsed as two messages:
--------------------------------------------------------------
From announcements@example.org Thu Apr 4 05:50:22 2002
Return-Path: <announcements@example.org>
From: Announcements <announcements@example.org>
To: Someone <someone@example.edu>
Subject: some message subject
Date: Thu, 4 Apr 2002 08:48:28 -0500
From something, 2002:
--------------------------------------------------------------
If that year exceeds a reasonable range (>=3000) the message will be correctly treated as a single message:
--------------------------------------------------------------
From announcements@example.org Thu Apr 4 05:50:22 2002
Return-Path: <announcements@example.org>
From: Announcements <announcements@example.org>
To: Someone <someone@example.edu>
Subject: some message subject
Date: Thu, 4 Apr 2002 08:48:28 -0500
From something, 3002:
--------------------------------------------------------------
One defense against this might be sanity-checking against Content-Length or Lines headers - the last message which I encountered had both.