Bug #40027 for Encode: decode of MIME-Header removes too much whitespace

Tue Oct 14 02:36:21 2008 cpan [...] robm.fastmail.fm - Ticket created

Subject:

decode of MIME-Header removes too much whitespace

When doing: decode('MIME-Header', "a: b\r\n c") The result is: "a: bc" The folding whitespace is lost which is incorrect. See http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. The culprit is this line: # multi-line header to single line $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; I believe it should be: $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;

Wed Jan 21 17:39:57 2009 DANKOGAI [...] cpan.org - Correspondence added

On Tue Oct 14 02:36:21 2008, ROBM wrote: Show quoted text

> When doing: > > decode('MIME-Header', "a: b\r\n c") > > The result is: > > "a: bc" > > The folding whitespace is lost which is incorrect. See > http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: > > Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP. Each header field should be > treated in its unfolded form for further syntactic and semantic > evaluation.

But the previous paragraph also states that: Note: Though structured field bodies are defined in such a way that folding can take place between many of the lexical tokens (and even within some of the lexical tokens), folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items in preference to other places where the field could be folded, even if it is allowed elsewhere. So the folding can occur anywhere and the following folded line must start with a whitespace. Meaning the current implementation is okay % perl -MEncode -le 'print decode("MIME-Header", "a: b\r\n c")' a: bc % perl -MEncode -le 'print decode("MIME-Header", "a: b\r\n c")' a: b c Show quoted text

> > The culprit is this line: > > # multi-line header to single line > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; > > I believe it should be: > > $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;

No. The 1st whitespace must be removed because that is the only sign that the line was folded. Dan the Encode Maintainer

Wed Jan 21 17:39:57 2009 The RT System itself - Status changed from 'new' to 'open'

Wed Jan 21 17:39:58 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Jan 21 19:40:31 2009 cpan [...] robm.fastmail.fm - Correspondence added

Show quoted text

> > The folding whitespace is lost which is incorrect. See > > http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: > > > > Unfolding is accomplished by simply removing any CRLF > > that is immediately followed by WSP. Each header field should be > > treated in its unfolded form for further syntactic and semantic > > evaluation.

> > But the previous paragraph also states that: > ... > So the folding can occur anywhere and the following folded line must > start with a whitespace. > Meaning the current implementation is okay

No it isn't. I don't see what's ambiguous about this statement: Show quoted text

> > Unfolding is accomplished by simply removing any CRLF > > that is immediately followed by WSP.

It says nothing about removing *any* other whitespace. It says to ONLY remove the CRLF. Show quoted text

> > # multi-line header to single line > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos;

This removes the CRLF + the space or tab after it. That is clearly WRONG. Show quoted text

> > $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;

That does exactly what the RFC says. Remove the CRLF when it's immediately followed by whitespace. Rob

Wed Jan 21 19:40:32 2009 The RT System itself - Status changed from 'resolved' to 'open'

Sun Jul 12 21:32:41 2009 DANKOGAI [...] cpan.org - Correspondence added

On Wed Jan 21 19:40:31 2009, ROBM wrote: Show quoted text

> > > The folding whitespace is lost which is incorrect. See > > > http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: > > > > > > Unfolding is accomplished by simply removing any CRLF > > > that is immediately followed by WSP. Each header field should be > > > treated in its unfolded form for further syntactic and semantic > > > evaluation.

> > > > But the previous paragraph also states that: > > ... > > So the folding can occur anywhere and the following folded line must > > start with a whitespace. > > Meaning the current implementation is okay

> > No it isn't. I don't see what's ambiguous about this statement: >

> > > Unfolding is accomplished by simply removing any CRLF > > > that is immediately followed by WSP.

> > It says nothing about removing *any* other whitespace. It says to ONLY > remove the CRLF. >

> > > # multi-line header to single line > > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos;

> > This removes the CRLF + the space or tab after it. That is clearly WRONG. >

> > > $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;

> > That does exactly what the RFC says. Remove the CRLF when it's > immediately followed by whitespace. > > Rob

Neither of us were right. The correct syntax is: --- lib/Encode/MIME/Header.pm 2009/03/25 07:55:57 2.10 +++ lib/Encode/MIME/Header.pm 2009/07/13 01:22:57 @@ -44,7 +44,7 @@ $str =~ s/\?=\s+=\?/\?==\?/gos; # multi-line header to single line - $str =~ s/(?:\r\n|[\r\n])[ \t]+//gos; + $str =~ s/(?:\r\n|[\r\n])[ \t]//gos; 1 while ( $str =~ s/(=\?[-0-9A-Za-z_]+\?[Qq]\?)(.*?)\?=\1(.*?\?=)/$1$2$3/ ) You still have to remove the first white space which is mandated to be inserted. Otherwise t/mime-header.t fails. The patch will appear in Encode 2.35. Dan the Encode Maintainer.

Sun Jul 12 21:32:42 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Mon Jul 13 03:08:22 2009 cpan [...] robm.fastmail.fm - Correspondence added

Show quoted text

> You still have to remove the first white space which is mandated to be > inserted.

This is still *WRONG*. READ THE RFC! http://www.faqs.org/rfcs/rfc2822.html Show quoted text

> Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP

The bug here is that you're folding lines incorrectly, because you should only be folding at FWS. Again, READ THE RFC! http://www.faqs.org/rfcs/rfc2822.html Show quoted text

> The general rule is > that wherever this standard allows for folding white space (not > simply WSP characters), a CRLF may be inserted before any WSP.

You said: Show quoted text

> Otherwise t/mime-header.t fails.

This is because your folding encoding + folding routines are wrong. perl -e 'use Encode; print encode("MIME-Q", "Subject: \x{f3}");' Outputs: Subject:=?UTF-8?Q?=20=C3=B3?= You're encoding the space character into the encoded value, you should be doing: Subject: =?UTF-8?Q?=C3=B3?= So that for a long line, you can fold that into: Subject: =?UTF-8?Q?=C3=B3?= And have it unfold correctly. So the real problem is that the encoding is swallowing the FWS into the encoded-words, which then means you folding can't fold properly because there is no FWS left to fold on. Ick, just checking the code, it's inserting arbitrary "\n " lines itself, rather than actually folding at FWS points. That's just completely bogus. Basically it seems to me that the encode() function is broken, for two reasons: 1. It's absorbing all whitespace into the encoded-word, thus leaving no points for FWS to fold on 2. It's arbitrarily folding at any point and inserting "\n " at the fold point, rather than trying to find FWS and folding there. Referring back to RFC2822 again. Show quoted text

> The general rule is > that wherever this standard allows for folding white space (not > simply WSP characters), a CRLF may be inserted before any WSP.

So you can't just aribitrarily inserts a "\n " in your text, you have to insert a "\n" only before a FWS point so the unfolding works correctly. Rob

Mon Jul 13 03:08:22 2009 The RT System itself - Status changed from 'resolved' to 'open'

Tue Jul 28 18:22:43 2009 DANKOGAI [...] cpan.org - Correspondence added

On Mon Jul 13 03:08:22 2009, ROBM wrote: Show quoted text

>

> > You still have to remove the first white space which is mandated to be > > inserted.

> > This is still *WRONG*. READ THE RFC! > > http://www.faqs.org/rfcs/rfc2822.html

I did. AND RFC2027 http://www.ietf.org/rfc/rfc2047.txt Show quoted text

> > Unfolding is accomplished by simply removing any CRLF > > that is immediately followed by WSP

> > The bug here is that you're folding lines incorrectly, because you > should only be folding at FWS. Again, READ THE RFC!

THAT IS NOT THE ONLY CASE WHEN MIME ENCODING IS CONCERNED. RFC2047 Show quoted text

> 8. Examples > > The following are examples of message headers containing 'encoded- > word's: > > From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu> > To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk> > CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be> > Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= > =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

See The last line? It doedes to: Subject: If you can read this you understand the example. And =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= Decodes to 'If you can read this yo' And =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= Decodes to 'u understand the example.' As you see FWS is STRIPPED. Remember. MIME-(B/Q) Encoding is primarily for non-ascii characters and you cannot take whitespaces for granted as word delimiters. Read the RFC? Read the RFCs! Dan the Encode Maintainer.

Tue Jul 28 18:22:44 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Jul 28 21:15:09 2009 cpan [...] robm.fastmail.fm - Correspondence added

Show quoted text

> See The last line? It doedes to: > > Subject: If you can read this you understand the example. > > And > > =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= > > Decodes to > > 'If you can read this yo' > > And > > =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= > > Decodes to > > 'u understand the example.' > > As you see FWS is STRIPPED.

This has *nothing* to do with the folding. This has to do with a special case as documented in RFC2047. http://www.ietf.org/rfc/rfc2047.txt 6.2. Display of 'encoded-word's ... When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) ... Again, unfolding of headers is just done by: Show quoted text

> Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP

Rob

Tue Jul 28 21:15:10 2009 The RT System itself - Status changed from 'resolved' to 'open'

Wed Apr 14 21:22:08 2010 JMEHNLE [...] cpan.org - Cc JMEHNLE added

Wed Apr 21 17:35:35 2010 JMEHNLE [...] cpan.org - Correspondence added

Rob is correct. The RFC says only to remove the CRLF, *not* the WSP that is required to follow. Removal of white-space is performed only when joining to adjacent MIME-encoded words after decoding them. I.e., foo bar unfolds to «foo bar», whereas =?US-ASCII?Q?foo?= =?US-ASCII?Q?bar?= unfolds to «=?US-ASCII?Q?foo?= =?US-ASCII?Q?bar?=», which then decodes to «foobar» per RFC 2047, section 6.2 (start of page 10): http://tools.ietf.org/html/rfc2047#section-6.2 http://tools.ietf.org/html/rfc2047#page-10 Again, the white-space removal at work here is entirely due to RFC 2047, section 6.2, 3rd paragraph, and NOT due to RFC 2822.

Wed Apr 21 17:43:02 2010 JMEHNLE [...] cpan.org - Correspondence added

Wed Apr 21 17:43:58 2010 JMEHNLE [...] cpan.org - Correspondence added

RT still eating my spaces. F* you, RT. Anyway, you know what I meant.

Fri Jan 28 11:46:52 2011 cpan [...] jibsheet.com - Correspondence added

On Wed Apr 21 17:43:58 2010, JMEHNLE wrote: Show quoted text

> RT still eating my spaces. F* you, RT. Anyway, you know what I meant.

I suspect you want to toggle one of the display preferences on https://rt.cpan.org/Prefs/Other.html since I see correct spacing in the email generated by RT and in the web UI If you believe that there is a bug in rt.cpan.org, the correct place to report issues is the rt-cpan- admin address linked above. Thanks -kevin

Fri Oct 21 22:31:19 2011 rjbs [...] cpan.org - Correspondence added

To the best of my knowledge, Rob is correct on all counts. decode('MIME-Header', "a: b\r\n c") should be "a: b c" The example you (Dan) provided with regard to =?...?= strings was not on point, for the reasons Rob gave: its behavior was not because the whitespace should be entirely stripped, but because whitespace between encoded-words tokens is ignored. -- rjbs

Fri Oct 21 22:51:11 2011 DROLSKY [...] cpan.org - Correspondence added

Why is Encode dealing with folding at all? Folding is a separate operation from encoding. Encode should not do folding or unfolding. It should simply encode and decode.

Tue Jul 01 15:46:55 2014 NIKOLAS [...] cpan.org - Correspondence added

For not encoded headers: The general rule is that wherever this standard allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP. For encoded header: An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used. So, SPACE to be deleted only in encoded-words

Tue Oct 06 10:32:10 2015 Mark.Martinec [...] ijs.si - Correspondence added

From:

Mark.Martinec [...] ijs.si

Any chance of getting this fixed? Unfolding is still broken as of version 2.72 - for reasons clearly pointed out in the discussion above. When decoding, the proper RFC 5322 unfolding must be done first (without losing space or tab), and only then the whitespace between encoded words can be removed. Other whitespace must not be lost, even at folding points.

Thu Oct 29 10:12:16 2015 VKHERA [...] cpan.org - Correspondence added

Still busted as of 2.78. Specifically this From header properly preserves the space between the word "dog" and the emoji character following it when decoded by GMail web interface and Mail.app on Mac OS 10.11.1. From: "The quick brown fox runs over the lazy dog =?UTF-8?Q?=F0=9F=90=BA?=" <wolfy@example.com> When decoded using Encode::MIME::Header, it eats that space incorrectly. Sample program attached that produces this output: Encode version: 2.78 From: "The quick brown fox runs over the lazy dog🐺" <wolfy@example.com>

Subject:

encode-error

Download encode-error
application/octet-stream 275b

Message body not shown because it is not plain text.

Fri Oct 30 08:33:07 2015 rjbs [...] cpan.org - Correspondence added

Dan: Could we please see a fix for this in the next Encode? Right now, I have workarounds in a number of places both at work and in CPAN, and it would be nice to be able to tell people "just use Encode." Right now, I can't. If all that's needed at this point is a patch, I'm sure I could get you one. I'm not sure, though, that you agree that the current MIME-Header behavior is wrong. -- rjbs

Mon Nov 02 00:42:14 2015 DANKOGAI [...] cpan.org - Correspondence added

Please send me a patch, preferably with a test. Dan the Maintainer Thereof On Fri Oct 30 08:33:07 2015, RJBS wrote: Show quoted text

> Dan: > > Could we please see a fix for this in the next Encode? Right now, I > have workarounds in a number of places both at work and in CPAN, and > it would be nice to be able to tell people "just use Encode." Right > now, I can't. > > If all that's needed at this point is a patch, I'm sure I could get > you one. I'm not sure, though, that you agree that the current MIME- > Header behavior is wrong.

Fri Jan 22 01:27:25 2016 DANKOGAI [...] cpan.org - Correspondence added

cf. https://rt.cpan.org/Ticket/Display.html?id=88717 Dan the Maintainer Thereof On Mon Nov 02 00:42:14 2015, DANKOGAI wrote: Show quoted text

> Please send me a patch, preferably with a test. > > Dan the Maintainer Thereof > > On Fri Oct 30 08:33:07 2015, RJBS wrote:

> > Dan: > > > > Could we please see a fix for this in the next Encode? Right now, I > > have workarounds in a number of places both at work and in CPAN, and > > it would be nice to be able to tell people "just use Encode." Right > > now, I can't. > > > > If all that's needed at this point is a patch, I'm sure I could get > > you one. I'm not sure, though, that you agree that the current MIME- > > Header behavior is wrong.

> >

Fri Jan 22 01:27:27 2016 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Fri Jan 22 10:36:18 2016 JMEHNLE [...] cpan.org - Correspondence added

Sweet, thanks for fixing this, Dan! What version will this fix first be released with? I have some dependencies to update.

Fri Jan 22 11:17:49 2016 DANKOGAI [...] cpan.org - Correspondence added

Already released as 2.79. https://metacpan.org/release/Encode Dan the Maintainer Thereof On Fri Jan 22 10:36:18 2016, JMEHNLE wrote: Show quoted text

> Sweet, thanks for fixing this, Dan! > > What version will this fix first be released with? I have some > dependencies to update.