Bug #57945 for CGI: Charset is only added to text/* types

Fri May 28 06:07:40 2010 http://www.gerv.net/ - Ticket created

Subject:

Charset is only added to text/* types

When you set a charset in CGI.pm, e.g. using $self->charset(), it is only appended to the end of the Content-Type header if the type is a text/* type. Test program attached. Output: Status: 200 Content-Type: application/json Content Types which are not text/* but which benefit from a charset include: application/xml application/json image/svg+xml Failure to add a charset means that clients may interpret the data using the HTTP standard fallback charset of ISO-8859-1. Given that much data today is in UTF-8, this could lead to dataloss. Related bug: https://bugzilla.mozilla.org/show_bug.cgi?id=568503 perl -v: This is perl, v5.10.1 (*) built for i486-linux-gnu-thread-multi uname -a: Linux kitten 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Gerv

Subject:

test.pl

#!/usr/bin/perl -w use CGI; my $cgi = new CGI; $cgi->charset('UTF-8'); print $cgi->header(-type => 'application/json', -status => '200');

Fri May 28 06:09:15 2010 http://www.gerv.net/ - Correspondence added

print $cgi->header(-type => 'application/json', -status => '200', -charset => 'UTF-8'); The above form does work. Gerv

Fri May 28 06:09:16 2010 The RT System itself - Status changed from 'new' to 'open'

Fri May 28 08:37:50 2010 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date:	Fri, 28 May 2010 08:37:40 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Show quoted text

> When you set a charset in CGI.pm, e.g. using $self->charset(), it is > only appended to the end of the Content-Type header if the type is a > text/* type. > > Test program attached. Output: > > Status: 200 > Content-Type: application/json > > Content Types which are not text/* but which benefit from a charset include: > application/xml > application/json > image/svg+xml > > Failure to add a charset means that clients may interpret the data using > the HTTP standard fallback charset of ISO-8859-1. Given that much data > today is in UTF-8, this could lead to dataloss.

Thanks for the report, Gerv. Could you review the related RFCs for us to review what they have to say when charsets should or should not be added to content-type headers? Mark

Tue Jun 08 08:20:21 2010 http://www.gerv.net/ - Correspondence added

On Fri May 28 08:37:50 2010, mark@summersault.com wrote: Show quoted text

> Thanks for the report, Gerv. > > Could you review the related RFCs for us to review what they have to > say when charsets should or should not be added to content-type > headers?

Hmm. Sorry for the delay. I don't get email on changes (having used OpenID) and I can't see a "profile" or similar link in RT where I can add an email address... Here are some bits of relevant text: RFC 2616 (HTTP), section 3.4.1: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html "3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1." Section 3.7.1 states, in part: "The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value." RFC 4627 (JSON) says: http://www.ietf.org/rfc/rfc4627.txt "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." It then talks about sniffing the stream to be able to tell if other encodings have, in fact, been used. Ick. So one could argue: "Types other than text types are supposed to have a canonical representation defined when they are registered. For JSON, the canonical representation is "Unicode", and the defined method of determining the character set is by sniffing the first four bytes of the stream." Or one could argue: "My word, that's horrible. Add a charset parameter, for the love of Pete!" As for XML, the situation is more clear. RFC 3023 says: http://www.rfc-editor.org/rfc/rfc3023.txt "3.2 Application/xml Registration MIME media type name: application MIME subtype name: xml Mandatory parameters: none Optional parameters: charset Although listed as an optional parameter, the use of the charset parameter is STRONGLY RECOMMENDED, ..." So that's at least one non-text type where a charset should be added. In general, I suspect most XML MIME types (and there are _a_lot_) have this attitude to charsets. People need to be able to set this parameter. Do you need more from me here? Gerv

Mon Jun 14 16:42:15 2010 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date:	Mon, 14 Jun 2010 16:42:05 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Show quoted text

> Do you need more from me here?

That's been very helpful, thanks. Yanick, feel free to jump here with comments or action. Mark

Wed Jun 16 10:06:12 2010 yanick+cpan [...] babyl.dyndns.org - Correspondence added

On Mon Jun 14 16:42:15 2010, mark@summersault.com wrote: Show quoted text

> Yanick, feel free to jump here with comments or action.

Aye, aye, sir. :-) I'll look more closely at it when I have a few minutes to myself, but I'm thinking that the fix could be to change at CGI.pm ~ line 1561 if (defined $charset) { $self->charset($charset); } else { $charset = $self->charset if $type =~ /^text\//; } for if (defined $charset) { $self->charset($charset); } else { $charset = $self->charset; } In other words, adopt the chartset if it has been configured, no matter what the type is. The logic being that if somebody doesn't want the charset to be specified, he or she would not set it to any value in the first place. Anyway, I'll try to come with a patch and some tests for it soonishly.

Wed Jun 16 10:12:53 2010 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date:	Wed, 16 Jun 2010 10:12:43 -0400
To:	bug-CGI.pm [...] rt.cpan.org, lstein [...] cshl.org
From:	Mark Stosberg <mark [...] summersault.com>

Show quoted text

> I'm thinking that the fix could be to change at CGI.pm ~ line 1561 > > if (defined $charset) { > $self->charset($charset); > } else { > $charset = $self->charset if $type =~ /^text\//; > } > > for > > if (defined $charset) { > $self->charset($charset); > } else { > $charset = $self->charset; > } > > > In other words, adopt the chartset if it has been configured, no matter > what the type is. The logic being that if somebody doesn't want the > charset to be specified, he or she would not set it to any value in the > first place. > > Anyway, I'll try to come with a patch and some tests for it soonishly.

Thanks. That seems reasonable to me. Lincoln, would you like to review this change requeset and weigh in? Mark

Thu Jun 17 20:29:56 2010 yanick [...] babyl.dyndns.org - Correspondence added

Subject:	Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date:	Thu, 17 Jun 2010 20:30:07 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Yanick Champoux <yanick [...] babyl.dyndns.org>

Changes committed to http://github.com/yanick/CGI.pm/tree/charset The only thing I'm afraid of is apps that might freak upon receiving charsets for some mimetypes. Reading the HTTP specs at http://www.w3.org/Protocols/rfc2616/rfc2616- sec3.html, we have (as was quoted by Gerv) =head1 SPECS Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1. =cut The first paragraph seems to advocate the explicit mention of the charset (yay!), but the second one warns that some clients can indeed not like it. Now, new code can easily circumvent that problem by calling header( -charset => '' ); but old code will suddenly behave differently than it did before. Does anyone think that this could be an issue for someone out there? `/anick

Fri Jun 18 09:27:37 2010 mark [...] summersault.com - Correspondence added

Subject:	Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date:	Fri, 18 Jun 2010 09:27:27 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Mark Stosberg <mark [...] summersault.com>

Show quoted text

> but old code will suddenly behave differently than it did before. Does > anyone think that this could be an issue for someone out there?

Thanks for the review, Yanick. To answer your question, I'll say: "Yes." Generally, I like the philosophy of leaving CGI.pm alone unless there is clearly a bug. Any change potentially effects thousands of users. For things that are enhancements or maybe-bugs, I think it's reasonable that at least a couple independent parties feel strongly that the behavior should be changed. In this case, it seems like there is a reasonable workaround: an explicit charset can be set if you don't like the default. Mark

Thu Jun 24 20:14:56 2010 yanick [...] babyl.dyndns.org - Correspondence added

Subject:	Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date:	Thu, 24 Jun 2010 20:15:37 -0400
To:	bug-CGI.pm [...] rt.cpan.org
From:	Yanick Champoux <yanick [...] babyl.dyndns.org>

On June 18, 2010 09:27:38 am mark@summersault.com via RT wrote: Show quoted text

> > but old code will suddenly behave differently than it did before. Does > > anyone think that this could be an issue for someone out there?

> > To answer your question, I'll say: "Yes." > > Generally, I like the philosophy of leaving CGI.pm alone unless there > is clearly a bug. Any change potentially effects thousands of users. > For things that are enhancements or maybe-bugs, I think it's reasonable > that at least a couple independent parties feel strongly that the > behavior should be changed.

Sounds very reasonable, very prudent. I whole-heartily agree. Joy, `/anick

Fri Jun 25 12:25:35 2010 MARKSTOS [...] cpan.org - Severity Normal changed to Unimportant

Fri Jun 25 12:28:06 2010 MARKSTOS [...] cpan.org - Correspondence added

Gerv, So the summary for now is that we don't plan to modify CGI.pm immediately for this. If more of consensus arises that this should be changed-- and doing so would be more spec-compliant or real-world-compliant, we'll consider revisiting it. We'll leave the ticket open so that others might find it, and you can also promote the change others to gather support for it if you'd like. Thanks again for the feedback on CGI.pm. Mark

Tue Jul 20 00:07:33 2010 http://www.gerv.net/ - Correspondence added

Mark, I think you have slightly misunderstood this bug. This bug is not about sending a charset when one is not set, it's about the fact that I want to send a charset and I can't! If Perl code (like mine) explicitly calls: $self->charset("UTF-8") then I want CGI.pm to do what I've explicitly said and send the charset=UTF-8 parameter on the content type. Is that really a change that will break backwards compatibility? The only old code which will behave differently is code which sets a charset but didn't notice it wasn't actually being sent, and is actually talking to a very (very very) old HTTP/1.0 server which breaks when one _is_ sent. In other words, code which actually finally gets what it asks for, but turns out not to want it. If no charset if specified, keep not sending one. I'm fine with that. Can someone ping me with future updates? Having logged in with an OpenID, I don't get mail. gerv@gerv.net. Thanks :-) Gerv

Sat Nov 20 19:25:44 2010 MARKSTOS [...] cpan.org - Correspondence added

RT-Send-CC:

gerv [...] gerv.net

On Tue Jul 20 00:07:33 2010, http://www.gerv.net/ wrote: Show quoted text

> Mark, > > I think you have slightly misunderstood this bug. This bug is not

about Show quoted text

> sending a charset when one is not set, it's about the fact that I want > to send a charset and I can't!

I see. In that case, Yanick's proposed fix is the same thing that I would recommend. I've added some more tests for charset() now and merged and pushed Yanick's work. It should be in our next release. Mark

Sat Nov 20 19:25:46 2010 MARKSTOS [...] cpan.org - Status changed from 'open' to 'patched'

Sun Jan 23 20:45:35 2011 MARKSTOS [...] cpan.org - Correspondence added

Subject:

patch released for CGI.pm

Thanks for the bug report. A patch for it appeared in 3.51, if not sooner. Resolving. Mark

Sun Jan 23 20:45:36 2011 The RT System itself - Status changed from 'patched' to 'open'

Sun Jan 23 20:45:37 2011 MARKSTOS [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Aug 15 22:38:38 2012 MARKSTOS [...] cpan.org - Reference by ticket #67100 added

Fri May 23 14:29:35 2014 The RT System itself - Queue changed from CGI.pm to CGI

Bug #57945 for CGI: Charset is only added to text/* types

Preferred bug tracker