Bug #86195 for Mail-Sender: Long Subject lines with accented characters broken

Sun Jun 16 16:26:12 2013 andrew.jones11235 [...] gmail.com - Ticket created

Subject:	Long Subject lines with accented characters broken
Date:	Sun, 16 Jun 2013 22:25:32 +0200
To:	bug-Mail-Sender [...] rt.cpan.org
From:	Andrew Jones <andrew.jones11235 [...] gmail.com>

(This mail should have been sent with UTF encoding) Long, accented subject lines are broken. Tested on: Mail::Sender 0.8.21 (Fedora 18) perl v5.16.3 Mail::Sender 0.8.16 (Fedora 16) perl v5.14.3 I have no reason to believe the latest code addresses this bug Depending on the number of accents in the text the treatment of subject lines fails as the line gets longer (each utf-8 accented character adds 5 extra bytes to the header string plus an overhead of about 12 bytes of utf-8 related preamble and postamble so this can break for subject lines that do not look excessively long. To demonstrate here is a subject line sent by Mail::Sender that works although it looks wrong: Subject: =?utf-8?Q?Subject contains utf-8 (=C3=A9) and gets longer123456789 123456789 12345678?= (in case it was mangled by mail system the line wrapped after 'longer123456789' And this is the next one in the series (which is broken) Subject: =?utf-8?Q?Subject contains utf-8 (=C3=A9) and gets longer123456789 123456789 12345678= 9?= The final '9' is never displayed, even on later mails in the series As a workaround I tried to format a header correctly with the name 'Subject:' (formatting code is below) and leave the subject blank. This worked perfectly with my Evolution mail client but not with Outlook which stored the first Subject line it saw and silently discarded any future headers identified as 'Subject:' So I had to patch Mail::Sender by creating a new parameter for Mail::Sender::Open etc which I called fmtsubj my $stat = $sender->Open({ to => $recipient, encoding => $encoding, charset => $charset, ctype => 'text/html', fmtsubj => $formattedSubject, }); where: my $formattedSubject = code_header('Subject:', $subjectTextWithAccentsPerlFormatNotUnicode, {charset => $headercharset, encoding => $headercoding }); (source for code_header appears further down) The necessary changes to Sender.pm were This - $self->{'subject'} = "<No subject>" unless defined $self->{'subject'}; print_hdr $s, "Subject" => $self->{'subject'}, $self->{'charset'}; Becomes this - if (defined $self->{'fmtsubj'}) { print $s $self->{'fmtsubj'}; } else { $self->{'subject'} = "<No subject>" unless defined $self->{'subject'}; print_hdr $s, "Subject" => $self->{'subject'}, $self->{'charset'}; } This happens twice in the code. I have only tested it for my particular implementation and it works perfectly. However I make no guarantees for anyone else. #----code_header-------------------------------------------------------------------- # Calling syntax: $formattedSubject = code_header('Subject:', $headercontents, {charset => 'iso-8859-1', encoding => 'Quoted-Printable' }); # If not supplied the default values of charset will be utf-8 and encoding Base64 sub code_header { my ($caption, $headerDataRaw, $parptr) = @_; my $headercharset = $parptr->{charset} // 'utf-8'; # needs perl 5.16.2 http://perldoc.perl.org/perl5100delta.html#Defined-or-operator my $headerencoding = $parptr->{encoding} // 'Base64'; # needs perl 5.16.2 my @headlines = (); my $headerPreamble = ' =?'; if ($headercharset =~ /utf \-? 8/ix) { $headerPreamble .= 'utf-8?'; $headercharset = 'utf-8'; } elsif ($headercharset =~ /iso \- 8859 \- 1 \Z/ix) { $headerPreamble .= 'iso-8859-1?'; $headercharset = 'iso-8859-1'; } else { die "Currently unhandled charset '$headercharset' called for in code_header"; # (Roll your own) } if ( $headerencoding =~ /Base64/i ) { $headerPreamble .= 'B?'; $headerencoding = 'Base64'; } elsif ( $headerencoding =~ /Quoted \- Printable/ix ) { $headerPreamble .= 'Q?'; $headerencoding = 'Quoted-Printable'; } else { die "Currently unhandled encoding '$headerencoding' called for in code_header"; # I only know about Base64 and Quoted-Printable } my $headerPostamble = '?='; my $start = 0; #start of substring of header data # RFC 2045 P20 QP: The Quoted-Printable encoding REQUIRES that encoded lines be no more than 76 characters long. # RFC 2045 P21 QP: The 76 character limit does not count the trailing CRLF, but counts all other characters, including any equal signs. # RFC 2045 P25 B64: The encoded output stream must be represented in lines of no more than 76 characters each. my $maxBytesPerLine = 76; my $headerLine = $caption; # Initialize the first header line (caption should include the ':' but not the following space) my $headerLengthChars = length($headerDataRaw); # Length of header in *characters* - not encoded bytes my $headerDone=0; while (!$headerDone) { my $lineDone = 0; $headerLine .= $headerPreamble; my $bytesFree = $maxBytesPerLine - length($headerLine) - length($headerPostamble); #----Base64 if ('Base64' eq $headerencoding) { my $blocksFree = int ($bytesFree / 4); # It requires 4 bytes to encode 3 8bit characters my $maxCharsToDecode = 3 * $blocksFree; # although might not be enough space if non-ascii characters present my $length = ($start + $maxCharsToDecode < $headerLengthChars) ? $maxCharsToDecode : ($headerLengthChars - $start) ; while (!$lineDone) { my $teststringRaw = substr($headerDataRaw, $start, $length); my $teststringENC = Encode::encode($headercharset, $teststringRaw); my $temp = MIME::Base64::encode($teststringENC); chomp $temp; if (length($temp) <= $bytesFree) { # the encoded data fits in the space available $headerLine .= ($temp . $headerPostamble); push @headlines, $headerLine; $headerLine = ''; $start += $length; $lineDone = 1; } else { # the encoded data does not fit in the space available $length--; # shorten the substring by 1 *character* and try again } } # while (!$lineDone) $headerDone = 1 if ($start >= $headerLengthChars); #should never be greater but just in case... } # if ('Base64' eq $headerencoding) #----Quoted-Printable elsif ('Quoted-Printable' eq $headerencoding) { my $maxCharsToDecode = $bytesFree; my $length = ($start + $maxCharsToDecode < $headerLengthChars) ? $maxCharsToDecode : ($headerLengthChars - $start) ; # rfc2047 P6 The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as "_" (underscore, ASCII 95.). $headerDataRaw =~ s/ /_/g; while (!$lineDone) { my $teststringRaw = substr($headerDataRaw, $start, $length); my $teststringENC = Encode::encode($headercharset, $teststringRaw); my $temp = MIME::QuotedPrint::encode($teststringENC); chomp $temp; $temp =~ s/=\Z//; if (length($temp) <= $bytesFree) { # the encoded data fits in the space available $headerLine .= ($temp . $headerPostamble); push @headlines, $headerLine; $headerLine = ''; $start += $length; $lineDone = 1; } else { $length--; # shorten the substring by 1 *character* and try again } } # while (!$lineDone) $headerDone = 1 if ($start >= $headerLengthChars); #should never be greater than but just in case... } # elsif ('Quoted-Printable' eq $headerencoding) #---- } # while (!$headerDone) return join("\x0d\x0a", @headlines) . "\x0d\x0a"; } #----------------------------------------------------------------------- Examples for subject lines produced by the above follow; Subject line: 1234567891àçéïñôßü92123àçéïñôßü23456789412345678951234567896123456789712345678981234567899123456789 utf-8 Base64: Subject: =?utf-8?B?MTIzNDU2Nzg5McOgw6fDqcOvw7HDtMOfw7w5MjEyM8Ogw6fDqcOv?= =?utf-8?B?w7HDtMOfw7wyMzQ1Njc4OTQxMjM0NTY3ODk1MTIzNDU2Nzg5NjEyMzQ1Njc4?= =?utf-8?B?OTcxMjM0NTY3ODk4MTIzNDU2Nzg5OTEyMzQ1Njc4OQ==?= iso-8859-1 Base64: Subject: =?iso-8859-1?B?MTIzNDU2Nzg5MeDn6e/x9N/8OTIxMjPg5+nv8fTf/DIzNDU2?= =?iso-8859-1?B?Nzg5NDEyMzQ1Njc4OTUxMjM0NTY3ODk2MTIzNDU2Nzg5NzEyMzQ1Njc4?= =?iso-8859-1?B?OTgxMjM0NTY3ODk5MTIzNDU2Nzg5?= iso-8859-1 Quoted-Printable: Subject: =?iso-8859-1?Q?1234567891=E0=E7=E9=EF=F1=F4=DF=FC92123=E0=E7=E9?= =?iso-8859-1?Q?=EF=F1=F4=DF=FC2345678941234567895123456789612345678971234?= =?iso-8859-1?Q?5678981234567899123456789?= utf-8 Quoted-Printable: Subject: =?utf-8?Q?1234567891=C3=A0=C3=A7=C3=A9=C3=AF=C3=B1=C3=B4=C3=9F?= =?utf-8?Q?=C3=BC92123=C3=A0=C3=A7=C3=A9=C3=AF=C3=B1=C3=B4=C3=9F=C3=BC2345?= =?utf-8?Q?678941234567895123456789612345678971234567898123456789912345678?= =?utf-8?Q?9?= I have followed the rule in perlunitut: I/O flow (the actual 5 minute tutorial) The typical input/output flow of a program is: 1. Receive and decode 2. Process 3. Encode and output So for the test code I used for the above, Emacs saved the source file in utf8 and the first thing I had to do was decode it to perl's internal format my $acntstring = '1234567891àçéïñôßü92123àçéïñôßü23456789412345678951234567896123456789712345678981234567899123456789'; $acntstring = decode('utf-8', $acntstring); I forgot that step at one point and spent a day trying to work out why my subject lines were all double-encoded utf and displaying garbage. I have relied on Mail::Sender for years and I hope I have been able to give back something of value Best regards Andy Jones

Thu Nov 07 10:06:33 2013 jpl [...] plosquare.com - Correspondence added

From:

jpl [...] plosquare.com

I just ran into this bug as well, long subject with non-ASCII characters delivered as gibberish. Please consider applying the attached two-line patch to print_hdr instead. The extra parameter for encode_qp which prevents it from outputting line breaks has been in MIME::QuotedPrintable since 2009 (so it should be backward-compatible enough).

Subject:

print_hdr_qp_encoding.patch

--- orig/Mail-Sender-0.8.22/Sender.pm 2012-12-12 18:29:40.000000000 +0100 +++ new/Mail-Sender-0.8.22/Sender.pm 2013-11-07 16:00:05.883773811 +0100 @@ -181,8 +181,7 @@ my @parts = split /(\s*[,;<>]\s*)/, $str; for (@parts) { next unless /[^[:ascii:]]/; - $_ = encode_qp($_); - s/=\r?\n$//; + $_ = encode_qp($_, ''); s/(\s)/'=' . sprintf '%x',ord($1)/ge; $_ = "=?$charset?Q?" . $_ . "?="; }

Thu Nov 07 10:06:33 2013 The RT System itself - Status changed from 'new' to 'open'

Sat Dec 07 18:29:06 2013 jpl [...] plosquare.com - Correspondence added

From:

jpl [...] plosquare.com

On Thu Nov 07 10:06:33 2013, jploski wrote: Show quoted text

> I just ran into this bug as well, long subject with non-ASCII > characters delivered as gibberish. > > Please consider applying the attached two-line patch to print_hdr > instead. The extra parameter for encode_qp which prevents it from > outputting line breaks has been in MIME::QuotedPrintable since 2009 > (so it should be backward-compatible enough).

I noticed another problem - some (older?) versions of Outlook seem to adhere strictly to RFC2047, which calls for maximum length of 75 characters for an encoded-word. In such versions the subject would appear garbled and be displayed to recipient in its raw form, whereas other mail clients (mutt, Seamonkey) would present it correctly. To deal with this I updated the patch so that individual input words rather than the entire header are encoded (it's still not perfect - a very long input word with special characters would cause problems). Also of interest might be Email::MIME::RFC2047::Encoder, but I abandoned it after quick tests, as it seemed to double-encode umlaut characters for me and I didn't wish another dependency.

Subject:

print_hdr_qp_encoding.patch

--- orig/Mail-Sender-0.8.22/Sender.pm 2013-12-08 00:18:57.686853067 +0100 +++ new/Mail-Sender-0.8.22/Sender.pm 2013-12-08 00:19:54.046353555 +0100 @@ -178,11 +178,11 @@ if ($charset && $str =~ /[^[:ascii:]]/) { $str = encode( $charset, $str); - my @parts = split /(\s*[,;<>]\s*)/, $str; + my @parts = split /(\s*[,;<> ]\s*)/, $str; for (@parts) { next unless /[^[:ascii:]]/; - $_ = encode_qp($_); - s/=\r?\n$//; + $_ = encode_qp($_, ''); + #s/=\r?\n$//; s/(\s)/'=' . sprintf '%x',ord($1)/ge; $_ = "=?$charset?Q?" . $_ . "?="; }

Mon Dec 09 11:48:53 2013 jpl [...] plosquare.com - Correspondence added

From:

jpl [...] plosquare.com

Next try: the one-character fix in last patch turned out insufficient, as it caused two consecutive encoded-words to appear concatenated when the header was displayed in mail client (the whitespace bewtween encoded-words apparently doesn't count). So I fixed the patch to include following whitespace into the encoded-word. Also added encoding for \n, \r and \t characters (they are matched by [:ascii:], but may not appear literally in a header. Obviously, it's still just dirty hacks rather than a clean RFC2047-compliant version. Even so, it works better than the original, so I'd apply it in new releases (with some comments for KNOWN BUGS). On Sat Dec 07 18:29:06 2013, jploski wrote: Show quoted text

> On Thu Nov 07 10:06:33 2013, jploski wrote:

> > I just ran into this bug as well, long subject with non-ASCII > > characters delivered as gibberish. > > > > Please consider applying the attached two-line patch to print_hdr > > instead. The extra parameter for encode_qp which prevents it from > > outputting line breaks has been in MIME::QuotedPrintable since 2009 > > (so it should be backward-compatible enough).

> > I noticed another problem - some (older?) versions of Outlook seem to > adhere strictly to RFC2047, which calls for maximum length of 75 > characters for an encoded-word. In such versions the subject would > appear garbled and be displayed to recipient in its raw form, whereas > other mail clients (mutt, Seamonkey) would present it correctly. > > To deal with this I updated the patch so that individual input words > rather than the entire header are encoded (it's still not perfect - a > very long input word with special characters would cause problems). > > Also of interest might be Email::MIME::RFC2047::Encoder, but I > abandoned it after quick tests, as it seemed to double-encode umlaut > characters for me and I didn't wish another dependency.

Subject:

print_hdr_qp_encoding.patch

--- orig/Mail-Sender-0.8.22/Sender.pm 2012-12-12 18:29:40.000000000 +0100 +++ new/Mail-Sender-0.8.22/Sender.pm 2013-12-09 16:53:29.842935378 +0100 @@ -178,15 +178,20 @@ if ($charset && $str =~ /[^[:ascii:]]/) { $str = encode( $charset, $str); - my @parts = split /(\s*[,;<>]\s*)/, $str; - for (@parts) { - next unless /[^[:ascii:]]/; - $_ = encode_qp($_); - s/=\r?\n$//; - s/(\s)/'=' . sprintf '%x',ord($1)/ge; - $_ = "=?$charset?Q?" . $_ . "?="; + my @parts = split /(\s*[,;<> ]\s*)/, $str; + $str = ''; + for (my $i = 0; $i < @parts; $i++) { + my $part = $parts[$i]; + $part .= $parts[++$i] if ($i < $#parts && $parts[$i+1] =~ /^\s+$/); + if ($part =~ /[^[:ascii:]]/ || $part =~ /[\r\n\t]/) { + $part = encode_qp($part, ''); + $part =~ s/([\s\?])/'=' . sprintf '%02x',ord($1)/ge; + $str .= "=?$charset?Q?$part?="; + } + else { + $str .= $part; + } } - $str = join '', @parts; } $str =~ s/(?:\x0D\x0A?|\x0A)/\x0D\x0A/sg; # \n or \r => \r\n

Tue Jul 15 14:57:06 2014 JENDA [...] cpan.org - Correspondence added

I used the patch from the last message, it seems to work fine for me. Thanks! Patched in 0.8.23

Tue Jul 15 14:57:07 2014 JENDA [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #86195 for Mail-Sender: Long Subject lines with accented characters broken

Preferred bug tracker