Subject: | incorrect unfolding and other decoding bugs in Encode::MIME::RFC2047 |
Date: | Tue, 19 Apr 2011 01:25:55 +0200 |
To: | bug-Encode [...] rt.cpan.org |
From: | Florian Zumbiehl <florz [...] florz.de> |
Hi,
I started digging because of the incorrect unfolding in
Encode::MIME::RFC2047, which I now noticed is already reported as
bug #40027. Essentially people have already explained it correctly:
unfolding only eats the CRLF, nothing else (just as the RFC quite
clearly states). RFC2047 decoding then additionally eats whitespace
between encoded words in *text. _Between_ encoded words only.
As I couldn't figure out how to submit additional information for
a bug without opening an account, please feel free to merge things
as appropriate.
This is very relevant practically as the traditional way for breaking
long Subject headers, for example, was to insert CRLFs at the beginning of
whitespace sequences (well, and still is where no RFC2047 encoding is
necessary), which you corrupt with the current code.
While digging, I found a bunch more bugs and put together a fix
which you find below that should bring the code a lot closer to
the RFC.
This code indeed is for *text only - there is no way to decode
other headers that contain encoded words without first taking apart
the respective headers and then decoding words separately anyhow.
Also, here is a list of test cases with their respective correct
decoding:
"foo =?us-ascii?q?bar?=" => "foo bar"
"foo\r\n =?us-ascii?q?bar?=" => "foo bar"
"=?us-ascii?q?foo?= bar" => "foo bar"
"=?us-ascii?q?foo?=\r\n bar" => "foo bar"
"foo bar" => "foo bar"
"foo\r\n bar" => "foo bar"
"=?us-ascii?q?foo?= =?us-ascii?q?bar?=" => "foobar"
"=?us-ascii?q?foo?=\r\n =?us-ascii?q?bar?=" => "foobar"
"foo=?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?="
"=?us-ascii?q?foo?==?us-ascii?q?bar?=" => "foo=?us-ascii?q?bar?="
"=?us-ascii?q?foo bar?=" => "=?us-ascii?q?foo bar?="
"=?us-ascii?q?foo\r\n bar?=" => "=?us-ascii?q?foo bar?="
"foo =?us-ascii?q?=20?==?us-ascii?q?bar?=" => "foo =?us-ascii?q?bar?="
Please note that the code is untested as a whole, I just tested pieces
separately.
diff --git a/cpan/Encode/lib/Encode/MIME/Header.pm b/cpan/Encode/lib/Encode/MIME/Header.pm
index 9728dc3..44c7024 100644
--- a/cpan/Encode/lib/Encode/MIME/Header.pm
+++ b/cpan/Encode/lib/Encode/MIME/Header.pm
@@ -40,23 +40,25 @@ sub decode($$;$) {
use utf8;
my ( $obj, $str, $chk ) = @_;
- # zap spaces between encoded words
- $str =~ s/\?=\s+=\?/\?==\?/gos;
-
# multi-line header to single line
- $str =~ s/(?:\r\n|[\r\n])[ \t]//gos;
-
- 1 while ( $str =~
- s/(=\?[-0-9A-Za-z_]+\?[Qq]\?)(.*?)\?=\1(.*?\?=)/$1$2$3/ )
+ $str =~ s/(?:\r\n|[\r\n])(?=[ \t])//gos;
+
+ 1 while ( $str =~ s/
+ (?:\A|(?<=[ \t]))
+ (=\?[-0-9A-Za-z_]+\?[Qq]\?)([\x21-\x3e\x40-\x7e]+)\?=
+ [ \t]+
+ \1([\x21-\x3e\x40-\x7e]+\?=)
+ /$1$2$3/x )
; # Concat consecutive QP encoded mime headers
# Fixes breaking inside multi-byte characters
$str =~ s{
+ (?:\A|\G[ \t]+|(?<=[ \t]))
=\? # begin encoded word
([-0-9A-Za-z_]+) # charset (encoding)
(?:\*[A-Za-z]{1,8}(?:-[A-Za-z]{1,8})*)? # language (RFC 2231)
\?([QqBb])\? # delimiter
- (.*?) # Base64-encodede contents
+ ([\x21-\x3e\x40-\x7e]+)
\?= # end encoded word
}{
if (uc($2) eq 'B'){
Florian