Subject: | Long Subject lines with accented characters broken |
Date: | Sun, 16 Jun 2013 22:25:32 +0200 |
To: | bug-Mail-Sender [...] rt.cpan.org |
From: | Andrew Jones <andrew.jones11235 [...] gmail.com> |
(This mail should have been sent with UTF encoding)
Long, accented subject lines are broken.
Tested on:
Mail::Sender 0.8.21 (Fedora 18) perl v5.16.3
Mail::Sender 0.8.16 (Fedora 16) perl v5.14.3
I have no reason to believe the latest code addresses this bug
Depending on the number of accents in the text the treatment of subject lines
fails as the line gets longer (each utf-8 accented character adds 5 extra
bytes to the header string plus an overhead of about 12 bytes of utf-8 related
preamble and postamble so this can break for subject lines that do not look
excessively long.
To demonstrate here is a subject line sent by Mail::Sender that works although
it looks wrong:
Subject: =?utf-8?Q?Subject contains utf-8 (=C3=A9) and gets longer123456789
123456789 12345678?=
(in case it was mangled by mail system the line wrapped after 'longer123456789'
And this is the next one in the series (which is broken)
Subject: =?utf-8?Q?Subject contains utf-8 (=C3=A9) and gets longer123456789
123456789 12345678= 9?=
The final '9' is never displayed, even on later mails in the series
As a workaround I tried to format a header correctly with the name 'Subject:'
(formatting code is below) and leave the subject blank. This worked perfectly
with my Evolution mail client but not with Outlook which stored the first
Subject line it saw and silently discarded any future headers identified as
'Subject:'
So I had to patch Mail::Sender by creating a new parameter for
Mail::Sender::Open etc which I called fmtsubj
my $stat = $sender->Open({ to => $recipient,
encoding => $encoding,
charset => $charset,
ctype => 'text/html',
fmtsubj => $formattedSubject,
});
where:
my $formattedSubject = code_header('Subject:', $subjectTextWithAccentsPerlFormatNotUnicode, {charset => $headercharset, encoding => $headercoding });
(source for code_header appears further down)
The necessary changes to Sender.pm were
This -
$self->{'subject'} = "<No subject>" unless defined $self->{'subject'};
print_hdr $s, "Subject" => $self->{'subject'}, $self->{'charset'};
Becomes this -
if (defined $self->{'fmtsubj'}) {
print $s $self->{'fmtsubj'};
}
else {
$self->{'subject'} = "<No subject>" unless defined $self->{'subject'};
print_hdr $s, "Subject" => $self->{'subject'}, $self->{'charset'};
}
This happens twice in the code. I have only tested it for my particular implementation and it
works perfectly. However I make no guarantees for anyone else.
#----code_header--------------------------------------------------------------------
# Calling syntax: $formattedSubject = code_header('Subject:', $headercontents, {charset => 'iso-8859-1', encoding => 'Quoted-Printable' });
# If not supplied the default values of charset will be utf-8 and encoding Base64
sub code_header {
my ($caption, $headerDataRaw, $parptr) = @_;
my $headercharset = $parptr->{charset} // 'utf-8'; # needs perl 5.16.2 http://perldoc.perl.org/perl5100delta.html#Defined-or-operator
my $headerencoding = $parptr->{encoding} // 'Base64'; # needs perl 5.16.2
my @headlines = ();
my $headerPreamble = ' =?';
if ($headercharset =~ /utf \-? 8/ix) {
$headerPreamble .= 'utf-8?';
$headercharset = 'utf-8';
}
elsif ($headercharset =~ /iso \- 8859 \- 1 \Z/ix) {
$headerPreamble .= 'iso-8859-1?';
$headercharset = 'iso-8859-1';
}
else {
die "Currently unhandled charset '$headercharset' called for in code_header"; # (Roll your own)
}
if ( $headerencoding =~ /Base64/i ) {
$headerPreamble .= 'B?';
$headerencoding = 'Base64';
}
elsif ( $headerencoding =~ /Quoted \- Printable/ix ) {
$headerPreamble .= 'Q?';
$headerencoding = 'Quoted-Printable';
}
else {
die "Currently unhandled encoding '$headerencoding' called for in code_header"; # I only know about Base64 and Quoted-Printable
}
my $headerPostamble = '?=';
my $start = 0; #start of substring of header data
# RFC 2045 P20 QP: The Quoted-Printable encoding REQUIRES that encoded lines be no more than 76 characters long.
# RFC 2045 P21 QP: The 76 character limit does not count the trailing CRLF, but counts all other characters, including any equal signs.
# RFC 2045 P25 B64: The encoded output stream must be represented in lines of no more than 76 characters each.
my $maxBytesPerLine = 76;
my $headerLine = $caption; # Initialize the first header line (caption should include the ':' but not the following space)
my $headerLengthChars = length($headerDataRaw); # Length of header in *characters* - not encoded bytes
my $headerDone=0;
while (!$headerDone) {
my $lineDone = 0;
$headerLine .= $headerPreamble;
my $bytesFree = $maxBytesPerLine - length($headerLine) - length($headerPostamble);
#----Base64
if ('Base64' eq $headerencoding) {
my $blocksFree = int ($bytesFree / 4); # It requires 4 bytes to encode 3 8bit characters
my $maxCharsToDecode = 3 * $blocksFree; # although might not be enough space if non-ascii characters present
my $length = ($start + $maxCharsToDecode < $headerLengthChars) ? $maxCharsToDecode : ($headerLengthChars - $start) ;
while (!$lineDone) {
my $teststringRaw = substr($headerDataRaw, $start, $length);
my $teststringENC = Encode::encode($headercharset, $teststringRaw);
my $temp = MIME::Base64::encode($teststringENC);
chomp $temp;
if (length($temp) <= $bytesFree) { # the encoded data fits in the space available
$headerLine .= ($temp . $headerPostamble);
push @headlines, $headerLine;
$headerLine = '';
$start += $length;
$lineDone = 1;
}
else { # the encoded data does not fit in the space available
$length--; # shorten the substring by 1 *character* and try again
}
} # while (!$lineDone)
$headerDone = 1 if ($start >= $headerLengthChars); #should never be greater but just in case...
} # if ('Base64' eq $headerencoding)
#----Quoted-Printable
elsif ('Quoted-Printable' eq $headerencoding) {
my $maxCharsToDecode = $bytesFree;
my $length = ($start + $maxCharsToDecode < $headerLengthChars) ? $maxCharsToDecode : ($headerLengthChars - $start) ;
# rfc2047 P6 The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as "_" (underscore, ASCII 95.).
$headerDataRaw =~ s/ /_/g;
while (!$lineDone) {
my $teststringRaw = substr($headerDataRaw, $start, $length);
my $teststringENC = Encode::encode($headercharset, $teststringRaw);
my $temp = MIME::QuotedPrint::encode($teststringENC);
chomp $temp;
$temp =~ s/=\Z//;
if (length($temp) <= $bytesFree) { # the encoded data fits in the space available
$headerLine .= ($temp . $headerPostamble);
push @headlines, $headerLine;
$headerLine = '';
$start += $length;
$lineDone = 1;
}
else {
$length--; # shorten the substring by 1 *character* and try again
}
} # while (!$lineDone)
$headerDone = 1 if ($start >= $headerLengthChars); #should never be greater than but just in case...
} # elsif ('Quoted-Printable' eq $headerencoding)
#----
} # while (!$headerDone)
return join("\x0d\x0a", @headlines) . "\x0d\x0a";
}
#-----------------------------------------------------------------------
Examples for subject lines produced by the above follow;
Subject line:
1234567891àçéïñôßü92123àçéïñôßü23456789412345678951234567896123456789712345678981234567899123456789
utf-8 Base64:
Subject: =?utf-8?B?MTIzNDU2Nzg5McOgw6fDqcOvw7HDtMOfw7w5MjEyM8Ogw6fDqcOv?=
=?utf-8?B?w7HDtMOfw7wyMzQ1Njc4OTQxMjM0NTY3ODk1MTIzNDU2Nzg5NjEyMzQ1Njc4?=
=?utf-8?B?OTcxMjM0NTY3ODk4MTIzNDU2Nzg5OTEyMzQ1Njc4OQ==?=
iso-8859-1 Base64:
Subject: =?iso-8859-1?B?MTIzNDU2Nzg5MeDn6e/x9N/8OTIxMjPg5+nv8fTf/DIzNDU2?=
=?iso-8859-1?B?Nzg5NDEyMzQ1Njc4OTUxMjM0NTY3ODk2MTIzNDU2Nzg5NzEyMzQ1Njc4?=
=?iso-8859-1?B?OTgxMjM0NTY3ODk5MTIzNDU2Nzg5?=
iso-8859-1 Quoted-Printable:
Subject: =?iso-8859-1?Q?1234567891=E0=E7=E9=EF=F1=F4=DF=FC92123=E0=E7=E9?=
=?iso-8859-1?Q?=EF=F1=F4=DF=FC2345678941234567895123456789612345678971234?=
=?iso-8859-1?Q?5678981234567899123456789?=
utf-8 Quoted-Printable:
Subject: =?utf-8?Q?1234567891=C3=A0=C3=A7=C3=A9=C3=AF=C3=B1=C3=B4=C3=9F?=
=?utf-8?Q?=C3=BC92123=C3=A0=C3=A7=C3=A9=C3=AF=C3=B1=C3=B4=C3=9F=C3=BC2345?=
=?utf-8?Q?678941234567895123456789612345678971234567898123456789912345678?=
=?utf-8?Q?9?=
I have followed the rule in perlunitut:
I/O flow (the actual 5 minute tutorial)
The typical input/output flow of a program is:
1. Receive and decode
2. Process
3. Encode and output
So for the test code I used for the above, Emacs saved the source file in utf8
and the first thing I had to do was decode it to perl's internal format
my $acntstring = '1234567891àçéïñôßü92123àçéïñôßü23456789412345678951234567896123456789712345678981234567899123456789';
$acntstring = decode('utf-8', $acntstring);
I forgot that step at one point and spent a day trying to work out why my subject lines were all
double-encoded utf and displaying garbage.
I have relied on Mail::Sender for years and I hope I have been able to give back something of value
Best regards
Andy Jones