Bug #42902 for Encode: Split header lines are joined incorrectly

Thu Jan 29 14:21:43 2009 MBETHKE [...] cpan.org - Ticket created

Subject:

Split header lines are joined incorrectly

Encode::MIME::Header uses the following regex to join split lines: $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; This is b0rked in two ways: first, "(:?" is not non-capturing parentheses but capturing parentheses with an optional colon before the CR. This probably turns up very seldom though :) The more severe bug that bit me today is that the replacement part is empty. RFC2047 is an extension of RFC822, so I suppose section 2.2.3 of the latter is authoritative here: The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding". Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. So to be strictly conforming, the expression should be $str =~ s/(?:\r|\n|\r\n)(?=[ \t])//gs; (I think it's fair not to include the more exotic WS characters in \s) However, CRLF followed by a TAB or multiple spaces is often used for lines that were originally split on a single space, so the following would probably come closer to what most people would expect (this is how mutt does it BTW, I haven't checked any other MUAs though): $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gs;

Sun Feb 01 05:33:20 2009 DANKOGAI [...] cpan.org - Correspondence added

Thanks, fixed in 2.28. the line is now replaced with $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gos; Dan the Maintainer Thereof On Thu Jan 29 14:21:43 2009, mbethke wrote: Show quoted text

> Encode::MIME::Header uses the following regex to join split lines: > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; > > This is b0rked in two ways: first, "(:?" is not non-capturing > parentheses but capturing parentheses with an optional colon before the > CR. This probably turns up very seldom though :) > > The more severe bug that bit me today is that the replacement part is > empty. RFC2047 is an extension of RFC822, so I suppose section 2.2.3 of > the latter is authoritative here: > > The process of moving from this folded multiple-line representation > of a header field to its single line representation is called > "unfolding". Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP. > > So to be strictly conforming, the expression should be > $str =~ s/(?:\r|\n|\r\n)(?=[ \t])//gs; > (I think it's fair not to include the more exotic WS characters in \s) > > However, CRLF followed by a TAB or multiple spaces is often used for > lines that were originally split on a single space, so the following > would probably come closer to what most people would expect (this is how > mutt does it BTW, I haven't checked any other MUAs though): > $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gs;

Sun Feb 01 05:33:21 2009 The RT System itself - Status changed from 'new' to 'open'

Sun Feb 01 05:33:21 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Feb 01 05:33:22 2009 DANKOGAI [...] cpan.org - Correspondence added

Thanks, fixed in 2.28. the line is now replaced with $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gos; Dan the Maintainer Thereof On Thu Jan 29 14:21:43 2009, mbethke wrote: Show quoted text

> Encode::MIME::Header uses the following regex to join split lines: > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; > > This is b0rked in two ways: first, "(:?" is not non-capturing > parentheses but capturing parentheses with an optional colon before the > CR. This probably turns up very seldom though :) > > The more severe bug that bit me today is that the replacement part is > empty. RFC2047 is an extension of RFC822, so I suppose section 2.2.3 of > the latter is authoritative here: > > The process of moving from this folded multiple-line representation > of a header field to its single line representation is called > "unfolding". Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP. > > So to be strictly conforming, the expression should be > $str =~ s/(?:\r|\n|\r\n)(?=[ \t])//gs; > (I think it's fair not to include the more exotic WS characters in \s) > > However, CRLF followed by a TAB or multiple spaces is often used for > lines that were originally split on a single space, so the following > would probably come closer to what most people would expect (this is how > mutt does it BTW, I haven't checked any other MUAs though): > $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gs;

Sun Feb 01 05:33:22 2009 The RT System itself - Status changed from 'resolved' to 'open'

Sun Feb 01 05:33:22 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sat May 09 10:59:56 2009 NWELLNHOF [...] cpan.org - Correspondence added

There shouldn't be a space inserted between two lines. See RFC 2047, section 6.2: When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.)

Sat May 09 10:59:57 2009 The RT System itself - Status changed from 'resolved' to 'open'

Sat May 09 11:10:47 2009 NWELLNHOF [...] cpan.org - Correspondence added

Note that all whitespace should only be removed between two encoded-words, not if there is other content involved. See bug #40027, which is also valid.

Sun Jul 12 21:38:21 2009 DANKOGAI [...] cpan.org - Correspondence added

On Sat May 09 11:10:47 2009, NWELLNHOF wrote: Show quoted text

> Note that all whitespace should only be removed between two > encoded-words, not if there is other content involved. See bug #40027, > which is also valid.

See #40027 Dan the Encode Maintainer

Sun Jul 12 21:38:23 2009 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Apr 21 17:51:51 2010 JMEHNLE [...] cpan.org - Cc JMEHNLE added