Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 40027
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: cpan [...] robm.fastmail.fm
Cc: JMEHNLE [...] cpan.org
AdminCc:

Bug Information
Severity: Normal
Broken in: 2.26
Fixed in: (no value)



Subject: decode of MIME-Header removes too much whitespace
When doing: decode('MIME-Header', "a: b\r\n c") The result is: "a: bc" The folding whitespace is lost which is incorrect. See http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. The culprit is this line: # multi-line header to single line $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; I believe it should be: $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;
On Tue Oct 14 02:36:21 2008, ROBM wrote: Show quoted text
> When doing: > > decode('MIME-Header', "a: b\r\n c") > > The result is: > > "a: bc" > > The folding whitespace is lost which is incorrect. See > http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: > > Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP. Each header field should be > treated in its unfolded form for further syntactic and semantic > evaluation.
But the previous paragraph also states that: Note: Though structured field bodies are defined in such a way that folding can take place between many of the lexical tokens (and even within some of the lexical tokens), folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items in preference to other places where the field could be folded, even if it is allowed elsewhere. So the folding can occur anywhere and the following folded line must start with a whitespace. Meaning the current implementation is okay % perl -MEncode -le 'print decode("MIME-Header", "a: b\r\n c")' a: bc % perl -MEncode -le 'print decode("MIME-Header", "a: b\r\n c")' a: b c Show quoted text
> > The culprit is this line: > > # multi-line header to single line > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; > > I believe it should be: > > $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;
No. The 1st whitespace must be removed because that is the only sign that the line was folded. Dan the Encode Maintainer
Show quoted text
> > The folding whitespace is lost which is incorrect. See > > http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: > > > > Unfolding is accomplished by simply removing any CRLF > > that is immediately followed by WSP. Each header field should be > > treated in its unfolded form for further syntactic and semantic > > evaluation.
> > But the previous paragraph also states that: > ... > So the folding can occur anywhere and the following folded line must > start with a whitespace. > Meaning the current implementation is okay
No it isn't. I don't see what's ambiguous about this statement: Show quoted text
> > Unfolding is accomplished by simply removing any CRLF > > that is immediately followed by WSP.
It says nothing about removing *any* other whitespace. It says to ONLY remove the CRLF. Show quoted text
> > # multi-line header to single line > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos;
This removes the CRLF + the space or tab after it. That is clearly WRONG. Show quoted text
> > $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;
That does exactly what the RFC says. Remove the CRLF when it's immediately followed by whitespace. Rob
On Wed Jan 21 19:40:31 2009, ROBM wrote: Show quoted text
> > > The folding whitespace is lost which is incorrect. See > > > http://www.faqs.org/rfcs/rfc2822.html section 2.2.3 which says: > > > > > > Unfolding is accomplished by simply removing any CRLF > > > that is immediately followed by WSP. Each header field should be > > > treated in its unfolded form for further syntactic and semantic > > > evaluation.
> > > > But the previous paragraph also states that: > > ... > > So the folding can occur anywhere and the following folded line must > > start with a whitespace. > > Meaning the current implementation is okay
> > No it isn't. I don't see what's ambiguous about this statement: >
> > > Unfolding is accomplished by simply removing any CRLF > > > that is immediately followed by WSP.
> > It says nothing about removing *any* other whitespace. It says to ONLY > remove the CRLF. >
> > > # multi-line header to single line > > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos;
> > This removes the CRLF + the space or tab after it. That is clearly WRONG. >
> > > $str =~ s/(:?\r|\n|\r\n)(?=[ \t])//gos;
> > That does exactly what the RFC says. Remove the CRLF when it's > immediately followed by whitespace. > > Rob
Neither of us were right. The correct syntax is: --- lib/Encode/MIME/Header.pm 2009/03/25 07:55:57 2.10 +++ lib/Encode/MIME/Header.pm 2009/07/13 01:22:57 @@ -44,7 +44,7 @@ $str =~ s/\?=\s+=\?/\?==\?/gos; # multi-line header to single line - $str =~ s/(?:\r\n|[\r\n])[ \t]+//gos; + $str =~ s/(?:\r\n|[\r\n])[ \t]//gos; 1 while ( $str =~ s/(=\?[-0-9A-Za-z_]+\?[Qq]\?)(.*?)\?=\1(.*?\?=)/$1$2$3/ ) You still have to remove the first white space which is mandated to be inserted. Otherwise t/mime-header.t fails. The patch will appear in Encode 2.35. Dan the Encode Maintainer.
Show quoted text
> You still have to remove the first white space which is mandated to be > inserted.
This is still *WRONG*. READ THE RFC! http://www.faqs.org/rfcs/rfc2822.html Show quoted text
> Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP
The bug here is that you're folding lines incorrectly, because you should only be folding at FWS. Again, READ THE RFC! http://www.faqs.org/rfcs/rfc2822.html Show quoted text
> The general rule is > that wherever this standard allows for folding white space (not > simply WSP characters), a CRLF may be inserted before any WSP.
You said: Show quoted text
> Otherwise t/mime-header.t fails.
This is because your folding encoding + folding routines are wrong. perl -e 'use Encode; print encode("MIME-Q", "Subject: \x{f3}");' Outputs: Subject:=?UTF-8?Q?=20=C3=B3?= You're encoding the space character into the encoded value, you should be doing: Subject: =?UTF-8?Q?=C3=B3?= So that for a long line, you can fold that into: Subject: =?UTF-8?Q?=C3=B3?= And have it unfold correctly. So the real problem is that the encoding is swallowing the FWS into the encoded-words, which then means you folding can't fold properly because there is no FWS left to fold on. Ick, just checking the code, it's inserting arbitrary "\n " lines itself, rather than actually folding at FWS points. That's just completely bogus. Basically it seems to me that the encode() function is broken, for two reasons: 1. It's absorbing all whitespace into the encoded-word, thus leaving no points for FWS to fold on 2. It's arbitrarily folding at any point and inserting "\n " at the fold point, rather than trying to find FWS and folding there. Referring back to RFC2822 again. Show quoted text
> The general rule is > that wherever this standard allows for folding white space (not > simply WSP characters), a CRLF may be inserted before any WSP.
So you can't just aribitrarily inserts a "\n " in your text, you have to insert a "\n" only before a FWS point so the unfolding works correctly. Rob
On Mon Jul 13 03:08:22 2009, ROBM wrote: Show quoted text
>
> > You still have to remove the first white space which is mandated to be > > inserted.
> > This is still *WRONG*. READ THE RFC! > > http://www.faqs.org/rfcs/rfc2822.html
I did. AND RFC2027 http://www.ietf.org/rfc/rfc2047.txt Show quoted text
> > Unfolding is accomplished by simply removing any CRLF > > that is immediately followed by WSP
> > The bug here is that you're folding lines incorrectly, because you > should only be folding at FWS. Again, READ THE RFC!
THAT IS NOT THE ONLY CASE WHEN MIME ENCODING IS CONCERNED. RFC2047 Show quoted text
> 8. Examples > > The following are examples of message headers containing 'encoded- > word's: > > From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu> > To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk> > CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be> > Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= > =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
See The last line? It doedes to: Subject: If you can read this you understand the example. And =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= Decodes to 'If you can read this yo' And =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= Decodes to 'u understand the example.' As you see FWS is STRIPPED. Remember. MIME-(B/Q) Encoding is primarily for non-ascii characters and you cannot take whitespaces for granted as word delimiters. Read the RFC? Read the RFCs! Dan the Encode Maintainer.
Show quoted text
> See The last line? It doedes to: > > Subject: If you can read this you understand the example. > > And > > =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?= > > Decodes to > > 'If you can read this yo' > > And > > =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?= > > Decodes to > > 'u understand the example.' > > As you see FWS is STRIPPED.
This has *nothing* to do with the folding. This has to do with a special case as documented in RFC2047. http://www.ietf.org/rfc/rfc2047.txt 6.2. Display of 'encoded-word's ... When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) ... Again, unfolding of headers is just done by: Show quoted text
> Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP
Rob
Rob is correct. The RFC says only to remove the CRLF, *not* the WSP that is required to follow. Removal of white-space is performed only when joining to adjacent MIME-encoded words after decoding them. I.e., foo bar unfolds to «foo bar», whereas =?US-ASCII?Q?foo?= =?US-ASCII?Q?bar?= unfolds to «=?US-ASCII?Q?foo?= =?US-ASCII?Q?bar?=», which then decodes to «foobar» per RFC 2047, section 6.2 (start of page 10): http://tools.ietf.org/html/rfc2047#section-6.2 http://tools.ietf.org/html/rfc2047#page-10 Again, the white-space removal at work here is entirely due to RFC 2047, section 6.2, 3rd paragraph, and NOT due to RFC 2822.
Stupid RT ate my spaces. I meant to say (using | as indentation markers): | foo | bar unfolds to «foo bar», whereas | =?US-ASCII?Q?foo?= | =?US-ASCII?Q?bar?= unfolds to «=?US-ASCII?Q?foo?= =?US-ASCII?Q?bar?=», which then decodes to «foobar» [...]
RT still eating my spaces. F* you, RT. Anyway, you know what I meant.
On Wed Apr 21 17:43:58 2010, JMEHNLE wrote: Show quoted text
> RT still eating my spaces. F* you, RT. Anyway, you know what I meant.
I suspect you want to toggle one of the display preferences on https://rt.cpan.org/Prefs/Other.html since I see correct spacing in the email generated by RT and in the web UI If you believe that there is a bug in rt.cpan.org, the correct place to report issues is the rt-cpan- admin address linked above. Thanks -kevin
To the best of my knowledge, Rob is correct on all counts. decode('MIME-Header', "a: b\r\n c") should be "a: b c" The example you (Dan) provided with regard to =?...?= strings was not on point, for the reasons Rob gave: its behavior was not because the whitespace should be entirely stripped, but because whitespace between encoded-words tokens is ignored. -- rjbs
Why is Encode dealing with folding at all? Folding is a separate operation from encoding. Encode should not do folding or unfolding. It should simply encode and decode.
For not encoded headers: The general rule is that wherever this standard allows for folding white space (not simply WSP characters), a CRLF may be inserted before any WSP. For encoded header: An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used. So, SPACE to be deleted only in encoded-words
From: Mark.Martinec [...] ijs.si
Any chance of getting this fixed? Unfolding is still broken as of version 2.72 - for reasons clearly pointed out in the discussion above. When decoding, the proper RFC 5322 unfolding must be done first (without losing space or tab), and only then the whitespace between encoded words can be removed. Other whitespace must not be lost, even at folding points.
Still busted as of 2.78. Specifically this From header properly preserves the space between the word "dog" and the emoji character following it when decoded by GMail web interface and Mail.app on Mac OS 10.11.1. From: "The quick brown fox runs over the lazy dog =?UTF-8?Q?=F0=9F=90=BA?=" <wolfy@example.com> When decoded using Encode::MIME::Header, it eats that space incorrectly. Sample program attached that produces this output: Encode version: 2.78 From: "The quick brown fox runs over the lazy dog🐺" <wolfy@example.com>
Subject: encode-error
Download encode-error
application/octet-stream 275b

Message body not shown because it is not plain text.

Dan: Could we please see a fix for this in the next Encode? Right now, I have workarounds in a number of places both at work and in CPAN, and it would be nice to be able to tell people "just use Encode." Right now, I can't. If all that's needed at this point is a patch, I'm sure I could get you one. I'm not sure, though, that you agree that the current MIME-Header behavior is wrong. -- rjbs
Please send me a patch, preferably with a test. Dan the Maintainer Thereof On Fri Oct 30 08:33:07 2015, RJBS wrote: Show quoted text
> Dan: > > Could we please see a fix for this in the next Encode? Right now, I > have workarounds in a number of places both at work and in CPAN, and > it would be nice to be able to tell people "just use Encode." Right > now, I can't. > > If all that's needed at this point is a patch, I'm sure I could get > you one. I'm not sure, though, that you agree that the current MIME- > Header behavior is wrong.
cf. https://rt.cpan.org/Ticket/Display.html?id=88717 Dan the Maintainer Thereof On Mon Nov 02 00:42:14 2015, DANKOGAI wrote: Show quoted text
> Please send me a patch, preferably with a test. > > Dan the Maintainer Thereof > > On Fri Oct 30 08:33:07 2015, RJBS wrote:
> > Dan: > > > > Could we please see a fix for this in the next Encode? Right now, I > > have workarounds in a number of places both at work and in CPAN, and > > it would be nice to be able to tell people "just use Encode." Right > > now, I can't. > > > > If all that's needed at this point is a patch, I'm sure I could get > > you one. I'm not sure, though, that you agree that the current MIME- > > Header behavior is wrong.
> >
Sweet, thanks for fixing this, Dan! What version will this fix first be released with? I have some dependencies to update.
Already released as 2.79. https://metacpan.org/release/Encode Dan the Maintainer Thereof On Fri Jan 22 10:36:18 2016, JMEHNLE wrote: Show quoted text
> Sweet, thanks for fixing this, Dan! > > What version will this fix first be released with? I have some > dependencies to update.