Bug #41661 for Mail-Box: UNKNOWN-encoded subjects are mis-study()-ed

Fri Dec 12 12:58:40 2008 fschlich [...] cis.fu-berlin.de - Ticket created

Subject:

UNKNOWN-encoded subjects are mis-study()-ed

if the subject is of =?UNKNOWN?Q? encoding, its contents are doubled when calling $msg->study('subject'). On the attached files: $ ./showsubj.pl example.mbox original subj is 'one =?UNKNOWN?Q?two?= three' studied subj is 'one one =?UNKNOWN?Q?two_?= threethree' decoded subj is 'one one =?UNKNOWN?Q?two_?= threethree' whereas I'd expect the studied/decoded subject to be identical to the original subject (as printed by $msg->subject), or preferably be converted to 'one two three' (that is, not decoded, but encoding markers removed)

Subject:

example.mbox

Download example.mbox
application/octet-stream 2.3k

Message body not shown because it is not plain text.

Subject:

showsubj.pl

#!/usr/bin/perl -w use strict; use diagnostics; use Mail::Box::Manager; my $mgr = Mail::Box::Manager->new; my $new_file = $ARGV[0]; my $new_folder = $mgr->open(folder => $new_file, access => 'rw', extract => 'ALWAYS', cache_body => 'DELAY', cache_head => 'DELAY') or die "cannot open folder $new_file: $!\n"; my $msg = $new_folder->message(0); print "original subj is '", $msg->subject, "'\n"; # but I want pure UTF-8, no ?iso-88..? etc left my $subject = $msg->study('subject'); if ($subject) { # msg might not have a Subject: line at all.... print " studied subj is '$subject'\n"; $subject = $subject->decodedBody(); print " decoded subj is '$subject'\n"; } else { print "SUBJECT is FALSE (no Subject: line?)\n"; }

Fri Dec 12 15:42:16 2008 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Fri, 12 Dec 2008 21:41:59 +0100
To:	Florian via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Florian via RT (bug-Mail-Box@rt.cpan.org) [081212 17:58]: Show quoted text

> Fri Dec 12 12:58:40 2008: Request 41661 was acted upon. > Transaction: Ticket created by fsfs > Queue: Mail-Box > Subject: UNKNOWN-encoded subjects are mis-study()-ed > Broken in: 2.086 > > if the subject is of =?UNKNOWN?Q? encoding, its contents are doubled > when calling $msg->study('subject'). On the attached files: > > $ ./showsubj.pl example.mbox > original subj is 'one =?UNKNOWN?Q?two?= three' > studied subj is 'one one =?UNKNOWN?Q?two_?= threethree' > decoded subj is 'one one =?UNKNOWN?Q?two_?= threethree'

I am very pleased with your continued report of the problems, and the very helpful examples you provide! The fix for this is in Mail/Message/Field/Full.pm method decode() $encoded =~ s/(\=\?([^?\s]*)\?([^?\s]*)\?([^?\s]*)\?\=\s*)/ _decoder($2,$3,$4,$1)/gse; The fourth parameter of _decoder was the whole line, not just the part to be decoded. -- MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Fri Dec 12 15:42:17 2008 The RT System itself - Status changed from 'new' to 'open'

Wed Dec 17 06:18:34 2008 fschlich [...] cis.fu-berlin.de - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Wed, 17 Dec 2008 12:18:09 +0100
To:	Mark Overmeer via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Florian Schlichting <fschlich [...] CIS.FU-Berlin.DE>

Hi, thanks for your quick reply; Show quoted text

> > $ ./showsubj.pl example.mbox > > original subj is 'one =?UNKNOWN?Q?two?= three' > > studied subj is 'one one =?UNKNOWN?Q?two_?= threethree' > > decoded subj is 'one one =?UNKNOWN?Q?two_?= threethree'

Show quoted text

> The fix for this is in Mail/Message/Field/Full.pm method decode() > $encoded =~ s/(\=\?([^?\s]*)\?([^?\s]*)\?([^?\s]*)\?\=\s*)/ > _decoder($2,$3,$4,$1)/gse; > > The fourth parameter of _decoder was the whole line, not just the > part to be decoded.

a lot better; but this results in: $ ./showsubj.pl example.mbox original subj is 'one =?UNKNOWN?Q?two?= three' studied subj is 'one =?UNKNOWN?Q?two_?= three' decoded subj is 'one =?UNKNOWN?Q?two_?= three' note the extra _ after "two" -- I think that's usually removed by Encode, but of course not in the case where $whole is returned. Also, in the case where no valid encoding can be found and thus no real decoding can happen, I'd prefer the encoding markers to be removed instead of kept. (The result is not perfect, but in the case of Western languages where most of the letters are ASCII, it becomes almost readable.) That is, _decode could return $encoded instead of $whole, but with the _ turned into a blank: $ ./showsubj.pl example.mbox original subj is 'one =?UNKNOWN?Q?two?= three' studied subj is 'one two three' decoded subj is 'one two three' Do you think that's possible? I don't really understand the intricacies of MIME encodings, otherwise I'd suggest to simply remove the fourth parameter to _decode and define 'my $whole = $encoded; $whole =~ s/_/ /gs;' Florian

Wed Dec 17 06:29:45 2008 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Wed, 17 Dec 2008 12:29:28 +0100
To:	Florian via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Florian via RT (bug-Mail-Box@rt.cpan.org) [081217 11:18]: Show quoted text

> Queue: Mail-Box > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=41661 > > > $ ./showsubj.pl example.mbox > original subj is 'one =?UNKNOWN?Q?two?= three' > studied subj is 'one =?UNKNOWN?Q?two_?= three' > decoded subj is 'one =?UNKNOWN?Q?two_?= three'

The "_" fix is on a different line. Already fixed in my prepared release, added just above the changed line: change it into a blank. Show quoted text

> Also, in the case where no valid encoding can be found and thus no real > decoding can happen, I'd prefer the encoding markers to be removed > instead of kept. (The result is not perfect, but in the case of Western > languages where most of the letters are ASCII, it becomes almost > readable.)

No... it is a security hazard: I could decode in something binary. By far most western charsets are already defined, so never UNKNOWN. Some weird codesets will not map nicely on ASCII. I have submitted a bugreport to Encode, to add all missing IANA charsets, but no answer yet. There is quite a lot missing. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Thu Dec 18 08:04:27 2008 fschlich [...] cis.fu-berlin.de - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Thu, 18 Dec 2008 14:04:11 +0100
To:	Mark Overmeer via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Florian Schlichting <fschlich [...] CIS.FU-Berlin.DE>

On Wed, Dec 17, 2008 at 06:29:45AM -0500, Mark Overmeer via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=41661 >

Show quoted text

> The "_" fix is on a different line. Already fixed in my prepared release, > added just above the changed line: change it into a blank.

good; I'll test it one last time with that release Show quoted text

> > Also, in the case where no valid encoding can be found and thus no real > > decoding can happen, I'd prefer the encoding markers to be removed > > instead of kept. (The result is not perfect, but in the case of Western > > languages where most of the letters are ASCII, it becomes almost > > readable.)

> > No... it is a security hazard: I could decode in something binary.

I don't understand: when removing the strings =?UNKNOWN?Q? and ?= (and perhaps a few _), the remainder wouldn't become something binary but remains ASCII? I meant that "n=E4chster" is more readable than "=?UNKNOWN?Q?n=E4chster?=" Show quoted text

> By far most western charsets are already defined, so never UNKNOWN. > Some weird codesets will not map nicely on ASCII.

I was thinking that it's too late for entirely non-ASCII scrips anyway, but a little bit might still be gained for the almost-ASCII ones... But I'm not going to insist on this point, as it's rather minor, and you could as well argue that a package such as yours shouldn't thanks for your continuing support and work on this package Florain

Thu Dec 18 08:15:30 2008 solutions [...] overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Thu, 18 Dec 2008 14:15:07 +0100
To:	Florian via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Mark Overmeer <solutions [...] overmeer.net>

* Florian via RT (bug-Mail-Box@rt.cpan.org) [081218 13:04]: Show quoted text

> Queue: Mail-Box > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=41661 > > > I don't understand: when removing the strings =?UNKNOWN?Q? and ?= (and > perhaps a few _), the remainder wouldn't become something binary but > remains ASCII? I meant that "n=E4chster" is more readable than > "=?UNKNOWN?Q?n=E4chster?="

I feel weak today: you convinced me ;-) Sorry that it took so long. Too much on my mind. -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Tue Feb 03 06:44:08 2009 MARKOV [...] cpan.org - Correspondence added

fixed in 2.087

Tue Feb 03 06:44:08 2009 MARKOV [...] cpan.org - Status changed from 'open' to 'resolved'

Mon Jun 08 03:53:58 2009 fschlich [...] cis.fu-berlin.de - Broken in 2.090 added

Mon Jun 08 03:53:59 2009 fschlich [...] cis.fu-berlin.de - Broken in 2.086 deleted

Mon Jun 08 03:53:59 2009 fschlich [...] cis.fu-berlin.de - Correspondence added

Sorry for leaving this untouched for such a long time, but there's one minor thing remaining, I just noticed: When running the originally attached skript/mbox with current Mail::Box: 2.090 I get: $ perl showsubj.pl example.mbox original subj is 'one =?UNKNOWN?Q?two?= three' studied subj is 'one two_three' decoded subj is 'one two_three' ^^^ What I think shouldn't be there is the underline ('_') character between 'two' and 'three', where the original had a space. This is only an issue for 'UNKNOWN' encodings, and not fixed by using up-to-date Encode 2.33 thanks

Mon Jun 08 03:53:59 2009 The RT System itself - Status changed from 'resolved' to 'open'

Mon Jun 08 05:23:01 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Mon, 8 Jun 2009 11:22:42 +0200
To:	Florian via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Florian via RT (bug-Mail-Box@rt.cpan.org) [090608 07:54]: Show quoted text

> Queue: Mail-Box > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=41661 > > > $ perl showsubj.pl example.mbox > original subj is 'one =?UNKNOWN?Q?two?= three' > studied subj is 'one two_three' > decoded subj is 'one two_three' > ^^^ > > What I think shouldn't be there is the underline ('_') character between > 'two' and 'three', where the original had a space. This is only an issue > for 'UNKNOWN' encodings, and not fixed by using up-to-date Encode 2.33

Should we declare that a bug in Encode? The decoding (the understanding whether it is a known charset) is left to Encode. Apparently, when it does understand the charset, then it decodes and replaces the '_'. When it does not understand the charset, it forgets to translate the '_'. Am I right? -- Regards, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Mon Jun 08 07:52:25 2009 fschlich [...] cis.fu-berlin.de - Correspondence added

CC:	fschlich [...] cis.fu-berlin.de
Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Mon, 8 Jun 2009 13:52:01 +0200
To:	Mark Overmeer via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Florian Schlichting <fschlich [...] CIS.FU-Berlin.DE>

On Mon, Jun 08, 2009 at 05:23:02AM -0400, Mark Overmeer via RT wrote: Show quoted text

> <URL: http://rt.cpan.org/Ticket/Display.html?id=41661 > > > * Florian via RT (bug-Mail-Box@rt.cpan.org) [090608 07:54]:

> > Queue: Mail-Box > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=41661 > > > > > $ perl showsubj.pl example.mbox > > original subj is 'one =?UNKNOWN?Q?two?= three' > > studied subj is 'one two_three' > > decoded subj is 'one two_three' > > ^^^ > > > > What I think shouldn't be there is the underline ('_') character between > > 'two' and 'three', where the original had a space. This is only an issue > > for 'UNKNOWN' encodings, and not fixed by using up-to-date Encode 2.33

> > Should we declare that a bug in Encode? > The decoding (the understanding whether it is a known charset) is left > to Encode. Apparently, when it does understand the charset, then it > decodes and replaces the '_'. When it does not understand the charset, > it forgets to translate the '_'. Am I right?

I was under the impression that in cases where Encode does not understand the charset, it wouldn't touch the field at all, and that it's a feature of Mail::Box that you were so kind to add after my pestering earlier in this ticket to simply strip the decoding marks in such cases. The bug would then be that decode() in Mail::Message::Field::Full adds a '_' as a kind of workaround for a problem in Encode (or perhaps a workaround for a problem in the regex in decode()?), which doesn't get removed if _decoder just returns the original $encoded. perhaps _decoder should, if Encode::find_encoding($charset) returns false, not just return $encoded, but first do a s/_/ /g on $encoded, for example as the first step in the following if-block? Like this: --- Mail-Box-2.090/lib/Mail/Message/Field/Full.pm 2009-06-02 11:56:45.000000000 +0200 +++ perl/share/perl/5.8.8/Mail/Message/Field/Full.pm 2009-06-08 13:49:11.265069004 +0200 @@ -270,11 +270,14 @@ sub _decoder($$$) { my ($charset, $encoding, $encoded) = @_; $charset =~ s/\*[^*]+$//; # language component not used - my $to_utf8 = Encode::find_encoding($charset || 'us-ascii') - or return $encoded; + my $to_utf8 = Encode::find_encoding($charset || 'us-ascii'); my $decoded; - if($encoding !~ /\S/) + if(!$to_utf8) + { $encoded =~ s/_/ /g; + return $encoded; + } + elsif($encoding !~ /\S/) { $decoded = $encoded; } elsif(lc($encoding) eq 'q') and then there's also the last (else) block of that if cascade... (NB: in decode(), it looks like you want to say 'if($is_text)' instead of 'if(defined...', for why else would you define $is_text in the previous line?) @@ -298,7 +301,7 @@ sub decode($@) { my ($self, $encoded, %args) = @_; my $is_text = defined $args{is_text} ? $args{is_text} : 1; - if(defined $args{is_text} ? $args{is_text} : 1) + if($is_text) { # in text, blanks between encoding must be removed, but otherwise kept :( # little trick to get this done: add an explicit blank. $encoded =~ s/\?\=\s(?!\s*\=\?|$)/_?= /gs; best, Florian

Mon Jun 08 08:00:58 2009 Mark [...] Overmeer.net - Correspondence added

Subject:	Re: [rt.cpan.org #41661] UNKNOWN-encoded subjects are mis-study()-ed
Date:	Mon, 8 Jun 2009 14:00:34 +0200
To:	Florian via RT <bug-Mail-Box [...] rt.cpan.org>
From:	Mark Overmeer <mark [...] overmeer.net>

* Florian via RT (bug-Mail-Box@rt.cpan.org) [090608 11:52]: Show quoted text

> perhaps _decoder should, if Encode::find_encoding($charset) returns > false, not just return $encoded, but first do a s/_/ /g on $encoded, for > example as the first step in the following if-block? Like this:

Yes, simple solution. Both changes included. -- Thanks, MarkOv ------------------------------------------------------------------------ Mark Overmeer MSc MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

Sun Sep 06 18:03:44 2009 MARKOV [...] cpan.org - Correspondence added

just release 2.091 fixing this

Sun Sep 06 18:03:45 2009 MARKOV [...] cpan.org - Status changed from 'open' to 'resolved'