Skip Menu |

This queue is for tickets about the Encode CPAN distribution.

Report information
The Basics
Id: 42902
Status: resolved
Priority: 0/
Queue: Encode

People
Owner: Nobody in particular
Requestors: MBETHKE [...] cpan.org
Cc: JMEHNLE [...] cpan.org
AdminCc:

Bug Information
Severity: Normal
Broken in:
  • 2.05
  • 2.06
Fixed in: (no value)



Subject: Split header lines are joined incorrectly
Encode::MIME::Header uses the following regex to join split lines: $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; This is b0rked in two ways: first, "(:?" is not non-capturing parentheses but capturing parentheses with an optional colon before the CR. This probably turns up very seldom though :) The more severe bug that bit me today is that the replacement part is empty. RFC2047 is an extension of RFC822, so I suppose section 2.2.3 of the latter is authoritative here: The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding". Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. So to be strictly conforming, the expression should be $str =~ s/(?:\r|\n|\r\n)(?=[ \t])//gs; (I think it's fair not to include the more exotic WS characters in \s) However, CRLF followed by a TAB or multiple spaces is often used for lines that were originally split on a single space, so the following would probably come closer to what most people would expect (this is how mutt does it BTW, I haven't checked any other MUAs though): $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gs;
Thanks, fixed in 2.28. the line is now replaced with $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gos; Dan the Maintainer Thereof On Thu Jan 29 14:21:43 2009, mbethke wrote: Show quoted text
> Encode::MIME::Header uses the following regex to join split lines: > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; > > This is b0rked in two ways: first, "(:?" is not non-capturing > parentheses but capturing parentheses with an optional colon before the > CR. This probably turns up very seldom though :) > > The more severe bug that bit me today is that the replacement part is > empty. RFC2047 is an extension of RFC822, so I suppose section 2.2.3 of > the latter is authoritative here: > > The process of moving from this folded multiple-line representation > of a header field to its single line representation is called > "unfolding". Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP. > > So to be strictly conforming, the expression should be > $str =~ s/(?:\r|\n|\r\n)(?=[ \t])//gs; > (I think it's fair not to include the more exotic WS characters in \s) > > However, CRLF followed by a TAB or multiple spaces is often used for > lines that were originally split on a single space, so the following > would probably come closer to what most people would expect (this is how > mutt does it BTW, I haven't checked any other MUAs though): > $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gs;
Thanks, fixed in 2.28. the line is now replaced with $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gos; Dan the Maintainer Thereof On Thu Jan 29 14:21:43 2009, mbethke wrote: Show quoted text
> Encode::MIME::Header uses the following regex to join split lines: > > $str =~ s/(:?\r|\n|\r\n)[ \t]//gos; > > This is b0rked in two ways: first, "(:?" is not non-capturing > parentheses but capturing parentheses with an optional colon before the > CR. This probably turns up very seldom though :) > > The more severe bug that bit me today is that the replacement part is > empty. RFC2047 is an extension of RFC822, so I suppose section 2.2.3 of > the latter is authoritative here: > > The process of moving from this folded multiple-line representation > of a header field to its single line representation is called > "unfolding". Unfolding is accomplished by simply removing any CRLF > that is immediately followed by WSP. > > So to be strictly conforming, the expression should be > $str =~ s/(?:\r|\n|\r\n)(?=[ \t])//gs; > (I think it's fair not to include the more exotic WS characters in \s) > > However, CRLF followed by a TAB or multiple spaces is often used for > lines that were originally split on a single space, so the following > would probably come closer to what most people would expect (this is how > mutt does it BTW, I haven't checked any other MUAs though): > $str =~ s/(?:\r|\n|\r\n)[ \t]+/ /gs;
There shouldn't be a space inserted between two lines. See RFC 2047, section 6.2: When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.)
Note that all whitespace should only be removed between two encoded-words, not if there is other content involved. See bug #40027, which is also valid.
On Sat May 09 11:10:47 2009, NWELLNHOF wrote: Show quoted text
> Note that all whitespace should only be removed between two > encoded-words, not if there is other content involved. See bug #40027, > which is also valid.
See #40027 Dan the Encode Maintainer