Bug #67569 for Encode: incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047

Mon Apr 18 19:26:08 2011 florz [...] florz.de - Ticket created

Subject:	incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047
Date:	Tue, 19 Apr 2011 01:25:55 +0200
To:	bug-Encode [...] rt.cpan.org
From:	Florian Zumbiehl <florz [...] florz.de>

Hi, I started digging because of the incorrect unfolding in Encode::MIME::RFC2047, which I now noticed is already reported as bug #40027. Essentially people have already explained it correctly: unfolding only eats the CRLF, nothing else (just as the RFC quite clearly states). RFC2047 decoding then additionally eats whitespace between encoded words in *text. _Between_ encoded words only. As I couldn't figure out how to submit additional information for a bug without opening an account, please feel free to merge things as appropriate. This is very relevant practically as the traditional way for breaking long Subject headers, for example, was to insert CRLFs at the beginning of whitespace sequences (well, and still is where no RFC2047 encoding is necessary), which you corrupt with the current code. While digging, I found a bunch more bugs and put together a fix which you find below that should bring the code a lot closer to the RFC. This code indeed is for *text only - there is no way to decode other headers that contain encoded words without first taking apart the respective headers and then decoding words separately anyhow. Also, here is a list of test cases with their respective correct decoding: "foo =?us-ascii?q?bar?=" => "foo bar" "foo\r\n =?us-ascii?q?bar?=" => "foo bar" "=?us-ascii?q?foo?= bar" => "foo bar" "=?us-ascii?q?foo?=\r\n bar" => "foo bar" "foo bar" => "foo bar" "foo\r\n bar" => "foo bar" "=?us-ascii?q?foo?= =?us-ascii?q?bar?=" => "foobar" "=?us-ascii?q?foo?=\r\n =?us-ascii?q?bar?=" => "foobar" "foo=?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?=" "=?us-ascii?q?foo?==?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?=" "=?us-ascii?q?foo bar?=" => "=?us-ascii?q?foo bar?=" "=?us-ascii?q?foo\r\n bar?=" => "=?us-ascii?q?foo bar?=" "foo =?us-ascii?q?=20?==?us-ascii?q?bar?=" => "foo =?us-ascii?q?bar?=" Please note that the code is untested as a whole, I just tested pieces separately. diff --git a/cpan/Encode/lib/Encode/MIME/Header.pm b/cpan/Encode/lib/Encode/MIME/Header.pm index 9728dc3..44c7024 100644 --- a/cpan/Encode/lib/Encode/MIME/Header.pm +++ b/cpan/Encode/lib/Encode/MIME/Header.pm @@ -40,23 +40,25 @@ sub decode($$;$) { use utf8; my ( $obj, $str, $chk ) = @_; - # zap spaces between encoded words - $str =~ s/\?=\s+=\?/\?==\?/gos; - # multi-line header to single line - $str =~ s/(?:\r\n|[\r\n])[ \t]//gos; - - 1 while ( $str =~ - s/(=\?[-0-9A-Za-z_]+\?[Qq]\?)(.*?)\?=\1(.*?\?=)/$1$2$3/ ) + $str =~ s/(?:\r\n|[\r\n])(?=[ \t])//gos; + + 1 while ( $str =~ s/ + (?:\A|(?<=[ \t])) + (=\?[-0-9A-Za-z_]+\?[Qq]\?)([\x21-\x3e\x40-\x7e]+)\?= + [ \t]+ + \1([\x21-\x3e\x40-\x7e]+\?=) + /$1$2$3/x ) ; # Concat consecutive QP encoded mime headers # Fixes breaking inside multi-byte characters $str =~ s{ + (?:\A|\G[ \t]+|(?<=[ \t])) =\? # begin encoded word ([-0-9A-Za-z_]+) # charset (encoding) (?:\*[A-Za-z]{1,8}(?:-[A-Za-z]{1,8})*)? # language (RFC 2231) \?([QqBb])\? # delimiter - (.*?) # Base64-encodede contents + ([\x21-\x3e\x40-\x7e]+) \?= # end encoded word }{ if (uc($2) eq 'B'){ Florian

Sat May 21 19:07:36 2011 DANKOGAI [...] cpan.org - Status changed from 'new' to 'open'

Sat May 21 19:10:27 2011 DANKOGAI [...] cpan.org - Correspondence added

I tried your patch but unfortunately it breaks existing tests. Dan the Maintainer Thereof On Mon Apr 18 19:26:08 2011, florz@florz.de wrote: Show quoted text

> Hi, > > I started digging because of the incorrect unfolding in > Encode::MIME::RFC2047, which I now noticed is already reported as > bug #40027. Essentially people have already explained it correctly: > unfolding only eats the CRLF, nothing else (just as the RFC quite > clearly states). RFC2047 decoding then additionally eats whitespace > between encoded words in *text. _Between_ encoded words only. > > As I couldn't figure out how to submit additional information for > a bug without opening an account, please feel free to merge things > as appropriate. > > This is very relevant practically as the traditional way for breaking > long Subject headers, for example, was to insert CRLFs at the > beginning of > whitespace sequences (well, and still is where no RFC2047 encoding is > necessary), which you corrupt with the current code. > > While digging, I found a bunch more bugs and put together a fix > which you find below that should bring the code a lot closer to > the RFC. > > This code indeed is for *text only - there is no way to decode > other headers that contain encoded words without first taking apart > the respective headers and then decoding words separately anyhow. > > Also, here is a list of test cases with their respective correct > decoding: > > "foo =?us-ascii?q?bar?=" => "foo bar" > "foo\r\n =?us-ascii?q?bar?=" => "foo bar" > "=?us-ascii?q?foo?= bar" => "foo bar" > "=?us-ascii?q?foo?=\r\n bar" => "foo bar" > "foo bar" => "foo bar" > "foo\r\n bar" => "foo bar" > "=?us-ascii?q?foo?= =?us-ascii?q?bar?=" => "foobar" > "=?us-ascii?q?foo?=\r\n =?us-ascii?q?bar?=" => "foobar" > "foo=?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?=" > "=?us-ascii?q?foo?==?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?=" > "=?us-ascii?q?foo bar?=" => "=?us-ascii?q?foo bar?=" > "=?us-ascii?q?foo\r\n bar?=" => "=?us-ascii?q?foo bar?=" > "foo =?us-ascii?q?=20?==?us-ascii?q?bar?=" => "foo =?us- > ascii?q?bar?=" > > Please note that the code is untested as a whole, I just tested pieces > separately. > > diff --git a/cpan/Encode/lib/Encode/MIME/Header.pm > b/cpan/Encode/lib/Encode/MIME/Header.pm > index 9728dc3..44c7024 100644 > --- a/cpan/Encode/lib/Encode/MIME/Header.pm > +++ b/cpan/Encode/lib/Encode/MIME/Header.pm > @@ -40,23 +40,25 @@ sub decode($$;$) { > use utf8; > my ( $obj, $str, $chk ) = @_; > > - # zap spaces between encoded words > - $str =~ s/\?=\s+=\?/\?==\?/gos; > - > # multi-line header to single line > - $str =~ s/(?:\r\n|[\r\n])[ \t]//gos; > - > - 1 while ( $str =~ > - s/(=\?[-0-9A-Za-z_]+\?[Qq]\?)(.*?)\?=\1(.*?\?=)/$1$2$3/ ) > + $str =~ s/(?:\r\n|[\r\n])(?=[ \t])//gos; > + > + 1 while ( $str =~ s/ > + (?:\A|(?<=[ \t])) > + (=\?[-0-9A-Za-z_]+\?[Qq]\?)([\x21-\x3e\x40-\x7e]+)\?= > + [ \t]+ > + \1([\x21-\x3e\x40-\x7e]+\?=) > + /$1$2$3/x ) > ; # Concat consecutive QP encoded mime headers > # Fixes breaking inside multi-byte characters > > $str =~ s{ > + (?:\A|\G[ \t]+|(?<=[ \t])) > =\? # begin encoded word > ([-0-9A-Za-z_]+) # charset (encoding) > (?:\*[A-Za-z]{1,8}(?:-[A-Za-z]{1,8})*)? # language (RFC 2231) > \?([QqBb])\? # delimiter > - (.*?) # Base64-encodede contents > + ([\x21-\x3e\x40-\x7e]+) > \?= # end encoded word > }{ > if (uc($2) eq 'B'){ > > Florian

Tue Sep 17 17:17:45 2013 wiml [...] hhhh.org - Correspondence added

On Sat May 21 19:10:27 2011, DANKOGAI wrote: Show quoted text

> I tried your patch but unfortunately it breaks existing tests.

The two tests it breaks may not actually be correct--- they aren't RFC2047-conformant examples, at least. As Florian Zumbiehl says, decoding To/From headers has to happen after tokenization if you want to get the right results; unless the caller has already done that, Encode::MIME::RFC2047 can probably only correctly decode the *text headers such as Subject.

Wed Sep 18 02:39:15 2013 florz [...] florz.de - Correspondence added

CC:	Wim Lewis via RT <bug-Encode [...] rt.cpan.org>
Subject:	Re: [rt.cpan.org #67569] incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047
Date:	Wed, 18 Sep 2013 08:39:23 +0200
To:	wiml [...] hhhh.org
From:	Florian Zumbiehl <florz [...] florz.de>

Hi, Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=67569 > > > On Sat May 21 19:10:27 2011, DANKOGAI wrote:

> > I tried your patch but unfortunately it breaks existing tests.

> > The two tests it breaks may not actually be correct--- they aren't RFC2047-conformant examples, at least. As Florian Zumbiehl says, decoding To/From headers has to happen after tokenization if you want to get the right results; unless the caller has already done that, Encode::MIME::RFC2047 can probably only correctly decode the *text headers such as Subject.

can you point me to the tests that are failing with the patch? I asked the maintainer for more specific information about two years ago by email but never got a reply, and unfortunately have forgotten most of the details by now ... Regards, Florian

Wed Sep 18 20:25:42 2013 wiml [...] hhhh.org - Correspondence added

On Wed Sep 18 02:39:15 2013, florz@florz.de wrote: Show quoted text

> can you point me to the tests that are failing with the patch?

The tests are the ones decoding $bheader and $qheader in t/mime-heaader.t. The relevant bits of text are From:=?UTF-8?B?IOWwj+mjvCDlvL4g?=<dankogai@dan.co.jp> To: dankogai@dan.co.jp (=?UTF-8?B?5bCP6aO8?==Kogai,=?UTF-8?B?IOW8vg==?== Dan) In both cases, the patched MIME::RFC2047 decoder doesn't translate the encoded-words which are run together with adjacent tokens, and I think the patched behavior is more correct. The first line would be decodable by a program which tokenized the header and passed only the 2047-encodable phrases to Encode, but the To: header shouldn't be decodable by an RFC2047 6.1(2) compliant decoder.

Thu Sep 19 07:45:54 2013 florz [...] florz.de - Correspondence added

Subject:	Re: [rt.cpan.org #67569] incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047
Date:	Thu, 19 Sep 2013 13:46:00 +0200
To:	Wim Lewis via RT <bug-Encode [...] rt.cpan.org>
From:	Florian Zumbiehl <florz [...] florz.de>

Hi, Show quoted text

> The tests are the ones decoding $bheader and $qheader in t/mime-heaader.t. The relevant bits of text are > > From:=?UTF-8?B?IOWwj+mjvCDlvL4g?=<dankogai@dan.co.jp> > To: dankogai@dan.co.jp (=?UTF-8?B?5bCP6aO8?==Kogai,=?UTF-8?B?IOW8vg==?== > Dan) > > In both cases, the patched MIME::RFC2047 decoder doesn't translate the encoded-words which are run together with adjacent tokens, and I think the patched behavior is more correct. The first line would be decodable by a program which tokenized the header and passed only the 2047-encodable phrases to Encode, but the To: header shouldn't be decodable by an RFC2047 6.1(2) compliant decoder.

Well, formally, it (the comment, that is) certainly should be decodable, and it should decode to "=?UTF-8?B?5bCP6aO8?==Kogai,=?UTF-8?B?IOW8vg==?== Dan" ;-) But, yeah, I agree, the test really doesn't make much sense. If $bheader is supposed to be an RFC2047 encoded string, then the decoding in $dheader is wrong, the only thing a correct decoder should change during decoding in that case is (part of) the Subject: field (stricly, only the last three atoms, the first one is not an encoded-word due to its length), and unfold some of the line breaks. Rejecting the input presumably would also be OK as the line breaks in parts don't actually follow the rules for RFC2047 encoded strings. If, on the other hand, this is supposed to be a full set of RFC822 message headers, then putting that through an RFC2047 decoder makes no sense at all, you might just as well try an HTML parser. This is an RFC2047 parser, not an RFC822 parser. I just noticed, though, that these test cases that I submitted probably are wrong as well: "=?us-ascii?q?foo?==?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?=" "foo =?us-ascii?q?=20?==?us-ascii?q?bar?=" => "foo =?us-ascii?q?bar?=" The correct decodings probably rather should look like this: "=?us-ascii?q?foo?==?us-ascii?q?bar?=" => "=?us-ascii?q?foo?==?us-ascii?q?bar?=" "foo =?us-ascii?q?=20?==?us-ascii?q?bar?=" => "foo =?us-ascii?q?=20?==?us-ascii?q?bar?=" If someone finally manages to merge this bugfix, I might be willing to fix the parser to handle those cases correctly as well, but for now the fix as it is should still be much better than the current state of affairs. Regards, Florian

Thu Oct 29 10:34:02 2015 VKHERA [...] cpan.org - Correspondence added

At least the broken decoding is consistent with the encoding in eating white spaces on line wraps. Given the header $h = "From: The quick brown fox runs over the lazy dog \N{WOLF FACE} <wolfy\@example.com>"; Encode produces the incorrect output: From:=?UTF-8?Q?=20The=20quick=20brown=20fox=20?= =?UTF-8?Q?runs=20over=20the=20lazy=20do?= =?UTF-8?Q?g=20=F0=9F=90=BA=20?=< wolfy@example.com> There are two things wrong here: First, when a properly implemented decode is run against that, there is a space between the "<" and the "wolfy". The broken decoder in Encode::MIME::Header eats that space, so it is self-consistent, yet wrong. Secondly, RFC 2047 forbids encoding the address part. Encode does that as well even though the documents state it will not encode parts that are not supposed to be. Given this header: my $h = "From: The quick brown fox runs over the lazy dog \N{WOLF FACE} <wolfy\N{WOLF FACE}\@example.com>"; The output looks like this: From:=?UTF-8?Q?=20The=20quick=20brown=20fox=20?= =?UTF-8?Q?runs=20over=20the=20lazy=20do?= =?UTF-8?Q?g=20=F0=9F=90=BA=20?=< =?UTF-8?Q?wolfy=F0=9F=90=BA=40example=2Ecom?=> Now, I'm not sure what you're supposed to do with that utf8 character in the address part, but 2047 says don't mess with it. Sending it raw works with at least some mail servers and clients.

Thu Oct 29 16:28:29 2015 florz [...] florz.de - Correspondence added

Subject:	Re: [rt.cpan.org #67569] incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047
Date:	Thu, 29 Oct 2015 21:28:11 +0100
To:	Vivek Khera via RT <bug-Encode [...] rt.cpan.org>
From:	Florian Zumbiehl <florz [...] florz.de>

Hi, Show quoted text

> Now, I'm not sure what you're supposed to do with that utf8 character in the address part, but 2047 says don't mess with it. Sending it raw works with at least some mail servers and clients.

You essentially have it all backwards. RFC2047, as far as address fields are concerned, is for encoding the display name, not for "encoding an address field" - feeding an address field into an RFC2047 encoder is a type error. This supposed RFC2047 encoder still is horribly broken, as this: Show quoted text

> my $h = "From: The quick brown fox runs over the lazy dog \N{WOLF FACE} <wolfy\N{WOLF FACE}\@example.com>";

Would have to be encoded correctly into something like this: "=?UTF-8?Q?From=3A?= The quick brown fox runs over the lazy dog =?UTF-8?Q?=F0=9F=90=BA_=3Cwolfy=F0=9F=90=BA=40example=2Ecom=3E?=" And then you could append an address and prepend "From:" in order to use this rather weird display name in the source address of an email. Regards, Florian

Fri Jan 22 01:28:14 2016 DANKOGAI [...] cpan.org - Correspondence added

cf. https://rt.cpan.org/Ticket/Display.html?id=88717 On Thu Oct 29 16:28:29 2015, florz@florz.de wrote: Show quoted text

> Hi, >

> > Now, I'm not sure what you're supposed to do with that utf8 character > > in the address part, but 2047 says don't mess with it. Sending it raw > > works with at least some mail servers and clients.

> > You essentially have it all backwards. RFC2047, as far as address > fields > are concerned, is for encoding the display name, not for "encoding an > address field" - feeding an address field into an RFC2047 encoder is a > type > error. This supposed RFC2047 encoder still is horribly broken, as > this: >

> > my $h = "From: The quick brown fox runs over the lazy dog \N{WOLF > > FACE} <wolfy\N{WOLF FACE}\@example.com>";

> > Would have to be encoded correctly into something like this: > > "=?UTF-8?Q?From=3A?= The quick brown fox runs over the lazy dog =?UTF- > 8?Q?=F0=9F=90=BA_=3Cwolfy=F0=9F=90=BA=40example=2Ecom=3E?=" > > And then you could append an address and prepend "From:" in order to > use > this rather weird display name in the source address of an email. > > Regards, Florian

Fri Jan 22 01:28:20 2016 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Fri Jan 22 12:48:47 2016 florz [...] florz.de - Correspondence added

Subject:	Re: [rt.cpan.org #67569] Resolved: incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047
Date:	Fri, 22 Jan 2016 18:48:32 +0100
To:	Dan Kogai via RT <bug-Encode [...] rt.cpan.org>
From:	Florian Zumbiehl <florz [...] florz.de>

Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=67569 > > > According to our records, your request has been resolved. If you have any > further questions or concerns, please respond to this message.

That's obviously bullshit.

Tue Mar 29 14:37:20 2016 pali [...] cpan.org - Cc PALI added

Tue Mar 29 14:37:30 2016 pali [...] cpan.org - Correspondence added

It should be fixed in 2.83.