Bug #13027 for MIME-tools: MIME::Words::encode_mimewords split one character into separated MIME blocks

Tue May 31 02:05:19 2005 Guest - Ticket created

Subject:

MIME::Words::encode_mimewords split one character into separated MIME blocks

MIME::Words::encode_mimewords split one character into separated MIME blocks, because it split string each 18bytes. And CJKT codecs has multibyte characters. It causes some troubles(missing characters, unreadable subjects, etc.) at MUAs which decodes the encoded strings. Attached patch works when multibyte(not us-ascii nor iso-8859-*) character string comes. It works as following: 1. decode string into UTF-8 2. sepalate them with 18characters(not byte) for each chunks.. 3. if all characters are in \x00-\xff(single byte chars), put them as is. 4. else, encode them into old charset and do MIME encode.

--- lib/MIME/Words.pm 2003-06-07 08:41:55.000000000 +0900 +++ my/lib/MIME/Words.pm 2005-05-31 13:24:04.689096392 +0900 @@ -307,16 +306,32 @@ my $charset = $params{Charset} || 'ISO-8859-1'; my $encoding = lc($params{Encoding} || 'q'); - ### Encode any "words" with unsafe characters. - ### We limit such words to 18 characters, to guarantee that the - ### worst-case encoding give us no more than 54 + ~10 < 75 characters - my $word; - $rawstr =~ s{([a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" - $word = $1; - (($word !~ /[$NONPRINT]/o) - ? $word ### no unsafe chars - : encode_mimeword($word, $encoding, $charset)); ### has unsafe chars - }xeg; + if ($charset =~ /^iso-8859-\d+$/i || lc($charset) == 'us-ascii') { + ### Encode any "words" with unsafe characters. + ### We limit such words to 18 characters, to guarantee that the + ### worst-case encoding give us no more than 54 + ~10 < 75 characters + my $word; + $rawstr =~ s{([a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" + $word = $1; + (($word !~ /[$NONPRINT]/o) + ? $word ### no unsafe chars + : encode_mimeword($word, $encoding, $charset)); ### has unsafe chars + }xeg; + } else { + ### Encode "words" which contains multibyte characters. + use Encode; + my $unistr = Encode::decode($charset, $rawstr); + my $word; + $unistr =~ s{(.{1,18})}{ ### get next "word" + $word = $1; + (($word =~ /^[a-zA-Z0-9\x7F-\xFF]+$/o) ### is printable? + ? Encode::encode('iso-8859-1', $word) ### decode single-byte chars + : encode_mimeword(Encode::encode($charset, $word), + $encoding, $charset)); ### has unsafe or multibyte char + }xeg; + $rawstr = $unistr; + } + $rawstr; }

Tue May 31 06:24:15 2005 Guest - Correspondence added

From:

Kazuo Moriwaka

[guest - Tue May 31 02:05:19 2005]: I fixed the patch. old one has some bugs..

--- Words.pm.orig 2005-01-14 04:23:15.000000000 +0900 +++ Words.pm 2005-05-31 19:18:01.994028896 +0900 @@ -306,16 +306,32 @@ my $charset = $params{Charset} || 'ISO-8859-1'; my $encoding = lc($params{Encoding} || 'q'); - ### Encode any "words" with unsafe characters. - ### We limit such words to 18 characters, to guarantee that the - ### worst-case encoding give us no more than 54 + ~10 < 75 characters - my $word; - $rawstr =~ s{([a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" - $word = $1; - (($word !~ /[$NONPRINT]/o) - ? $word ### no unsafe chars - : encode_mimeword($word, $encoding, $charset)); ### has unsafe chars - }xeg; + if ($charset =~ /^iso-8859-\d+$/i || lc($charset) eq 'us-ascii') { + ### Encode any "words" with unsafe characters. + ### We limit such words to 18 characters, to guarantee that the + ### worst-case encoding give us no more than 54 + ~10 < 75 characters + my $word; + $rawstr =~ s{([a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" + $word = $1; + (($word !~ /[$NONPRINT]/o) + ? $word ### no unsafe chars + : encode_mimeword($word, $encoding, $charset)); ### has unsafe chars + }xeg; + } else { + ### Encode "words" which contains multibyte characters. + use Encode; + my $unistr = Encode::decode($charset, $rawstr); + my $word; + $unistr =~ s{([\x00-\xFF]{1,18}|.{1,18})}{ ### get next "word" + $word = $1; + (($word =~ /^[\x00-\xFF]+$/o) ### is in 1 byte? + ? Encode::encode('iso-8859-1', $word) ### decode single-byte chars + : encode_mimeword(Encode::encode($charset, $word), + $encoding, $charset)); ### has unsafe or multibyte char + }xeg; + $rawstr = Encode::encode("iso-8859-1", $unistr); + } + $rawstr; }

Fri Jun 17 01:43:41 2005 Guest - Correspondence added

From:

Kazuo Moriwaka

[moriwaka - Tue May 31 06:24:15 2005]: I fixed again.

--- Words.pm.orig 2005-01-14 04:23:15.000000000 +0900 +++ Words.pm 2005-06-03 19:07:55.129896800 +0900 @@ -118,6 +118,7 @@ sub _encode_Q { my $str = shift; $str =~ s{([_\?\=$NONPRINT])}{sprintf("=%02X", ord($1))}eog; + $str =~ s/ /_/og; $str; } @@ -306,16 +307,32 @@ my $charset = $params{Charset} || 'ISO-8859-1'; my $encoding = lc($params{Encoding} || 'q'); - ### Encode any "words" with unsafe characters. - ### We limit such words to 18 characters, to guarantee that the - ### worst-case encoding give us no more than 54 + ~10 < 75 characters - my $word; - $rawstr =~ s{([a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" - $word = $1; - (($word !~ /[$NONPRINT]/o) - ? $word ### no unsafe chars - : encode_mimeword($word, $encoding, $charset)); ### has unsafe chars - }xeg; + if ($charset =~ /^iso-8859-\d+$/i || lc($charset) eq 'us-ascii') { + ### Encode any "words" with unsafe characters. + ### We limit such words to 18 characters, to guarantee that the + ### worst-case encoding give us no more than 54 + ~10 < 75 characters + my $word; + $rawstr =~ s{([a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" + $word = $1; + (($word !~ /[$NONPRINT]/o) + ? $word ### no unsafe chars + : encode_mimeword($word, $encoding, $charset)); ### has unsafe chars + }xeg; + } else { + ### Encode "words" which contains multibyte characters. + use Encode; + my $unistr = Encode::decode($charset, $rawstr); + my $word; + $unistr =~ s{([\x00-\xFF]{1,18}|[^\x00-\xFF]{1,18})}{ ### get next "word" + $word = $1; + (($word =~ /^[\x00-\xFF]+$/o) ### is in 1 byte? + ? Encode::encode('iso-8859-1', $word) ### decode single-byte chars + : encode_mimeword(Encode::encode($charset, $word), + $encoding, $charset)); ### has unsafe or multibyte char + }xeg; + $rawstr = Encode::encode("iso-8859-1", $unistr); + } + $rawstr; }

Mon Jun 11 02:45:17 2012 yuvallb [...] gmail.com - Correspondence added

From:

yuvallb [...] gmail.com

This is how I fixed it: *** Words.original 2012-06-10 17:01:49.000000000 +0300 --- Words.pm 2012-06-10 17:02:16.000000000 +0300 *************** *** 301,307 **** ### worst-case encoding give us no more than 54 + ~10 < 75 characters my $word; local $1; ! $rawstr =~ s{([ a-zA-Z0-9\x7F-\xFF]{1,18})}{ ### get next "word" $word = $1; (($word !~ /(?:[$NONPRINT])|(?:^\s+$)/o) ? $word ### no unsafe chars --- 301,307 ---- ### worst-case encoding give us no more than 54 + ~10 < 75 characters my $word; local $1; ! $rawstr =~ s{([ a-zA-Z0-9\x7F-\xFF]{1,18})(?![\x80-\xBF])}{ ### get next "word" $word = $1; (($word !~ /(?:[$NONPRINT])|(?:^\s+$)/o) ? $word ### no unsafe chars

Mon Jun 11 02:45:18 2012 The RT System itself - Status changed from 'new' to 'open'