Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the CGI CPAN distribution.

Report information
The Basics
Id: 57945
Status: resolved
Priority: 0/
Queue: CGI

People
Owner: Nobody in particular
Requestors: gerv-cpan [...] gerv.net
Cc:
AdminCc:

Bug Information
Severity: Unimportant
Broken in: (no value)
Fixed in: (no value)



Subject: Charset is only added to text/* types
When you set a charset in CGI.pm, e.g. using $self->charset(), it is only appended to the end of the Content-Type header if the type is a text/* type. Test program attached. Output: Status: 200 Content-Type: application/json Content Types which are not text/* but which benefit from a charset include: application/xml application/json image/svg+xml Failure to add a charset means that clients may interpret the data using the HTTP standard fallback charset of ISO-8859-1. Given that much data today is in UTF-8, this could lead to dataloss. Related bug: https://bugzilla.mozilla.org/show_bug.cgi?id=568503 perl -v: This is perl, v5.10.1 (*) built for i486-linux-gnu-thread-multi uname -a: Linux kitten 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Gerv
Subject: test.pl
#!/usr/bin/perl -w use CGI; my $cgi = new CGI; $cgi->charset('UTF-8'); print $cgi->header(-type => 'application/json', -status => '200');
print $cgi->header(-type => 'application/json', -status => '200', -charset => 'UTF-8'); The above form does work. Gerv
Subject: Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date: Fri, 28 May 2010 08:37:40 -0400
To: bug-CGI.pm [...] rt.cpan.org
From: Mark Stosberg <mark [...] summersault.com>
Show quoted text
> When you set a charset in CGI.pm, e.g. using $self->charset(), it is > only appended to the end of the Content-Type header if the type is a > text/* type. > > Test program attached. Output: > > Status: 200 > Content-Type: application/json > > Content Types which are not text/* but which benefit from a charset include: > application/xml > application/json > image/svg+xml > > Failure to add a charset means that clients may interpret the data using > the HTTP standard fallback charset of ISO-8859-1. Given that much data > today is in UTF-8, this could lead to dataloss.
Thanks for the report, Gerv. Could you review the related RFCs for us to review what they have to say when charsets should or should not be added to content-type headers? Mark
On Fri May 28 08:37:50 2010, mark@summersault.com wrote: Show quoted text
> Thanks for the report, Gerv. > > Could you review the related RFCs for us to review what they have to > say when charsets should or should not be added to content-type > headers?
Hmm. Sorry for the delay. I don't get email on changes (having used OpenID) and I can't see a "profile" or similar link in RT where I can add an email address... Here are some bits of relevant text: RFC 2616 (HTTP), section 3.4.1: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html "3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1." Section 3.7.1 states, in part: "The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value." RFC 4627 (JSON) says: http://www.ietf.org/rfc/rfc4627.txt "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." It then talks about sniffing the stream to be able to tell if other encodings have, in fact, been used. Ick. So one could argue: "Types other than text types are supposed to have a canonical representation defined when they are registered. For JSON, the canonical representation is "Unicode", and the defined method of determining the character set is by sniffing the first four bytes of the stream." Or one could argue: "My word, that's horrible. Add a charset parameter, for the love of Pete!" As for XML, the situation is more clear. RFC 3023 says: http://www.rfc-editor.org/rfc/rfc3023.txt "3.2 Application/xml Registration MIME media type name: application MIME subtype name: xml Mandatory parameters: none Optional parameters: charset Although listed as an optional parameter, the use of the charset parameter is STRONGLY RECOMMENDED, ..." So that's at least one non-text type where a charset should be added. In general, I suspect most XML MIME types (and there are _a_lot_) have this attitude to charsets. People need to be able to set this parameter. Do you need more from me here? Gerv
Subject: Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date: Mon, 14 Jun 2010 16:42:05 -0400
To: bug-CGI.pm [...] rt.cpan.org
From: Mark Stosberg <mark [...] summersault.com>
Show quoted text
> Do you need more from me here?
That's been very helpful, thanks. Yanick, feel free to jump here with comments or action. Mark
On Mon Jun 14 16:42:15 2010, mark@summersault.com wrote: Show quoted text
> Yanick, feel free to jump here with comments or action.
Aye, aye, sir. :-) I'll look more closely at it when I have a few minutes to myself, but I'm thinking that the fix could be to change at CGI.pm ~ line 1561 if (defined $charset) { $self->charset($charset); } else { $charset = $self->charset if $type =~ /^text\//; } for if (defined $charset) { $self->charset($charset); } else { $charset = $self->charset; } In other words, adopt the chartset if it has been configured, no matter what the type is. The logic being that if somebody doesn't want the charset to be specified, he or she would not set it to any value in the first place. Anyway, I'll try to come with a patch and some tests for it soonishly.
Subject: Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date: Wed, 16 Jun 2010 10:12:43 -0400
To: bug-CGI.pm [...] rt.cpan.org, lstein [...] cshl.org
From: Mark Stosberg <mark [...] summersault.com>
Show quoted text
> I'm thinking that the fix could be to change at CGI.pm ~ line 1561 > > if (defined $charset) { > $self->charset($charset); > } else { > $charset = $self->charset if $type =~ /^text\//; > } > > for > > if (defined $charset) { > $self->charset($charset); > } else { > $charset = $self->charset; > } > > > In other words, adopt the chartset if it has been configured, no matter > what the type is. The logic being that if somebody doesn't want the > charset to be specified, he or she would not set it to any value in the > first place. > > Anyway, I'll try to come with a patch and some tests for it soonishly.
Thanks. That seems reasonable to me. Lincoln, would you like to review this change requeset and weigh in? Mark
Subject: Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date: Thu, 17 Jun 2010 20:30:07 -0400
To: bug-CGI.pm [...] rt.cpan.org
From: Yanick Champoux <yanick [...] babyl.dyndns.org>
Changes committed to http://github.com/yanick/CGI.pm/tree/charset The only thing I'm afraid of is apps that might freak upon receiving charsets for some mimetypes. Reading the HTTP specs at http://www.w3.org/Protocols/rfc2616/rfc2616- sec3.html, we have (as was quoted by Gerv) =head1 SPECS Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1. =cut The first paragraph seems to advocate the explicit mention of the charset (yay!), but the second one warns that some clients can indeed not like it. Now, new code can easily circumvent that problem by calling header( -charset => '' ); but old code will suddenly behave differently than it did before. Does anyone think that this could be an issue for someone out there? `/anick
Subject: Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date: Fri, 18 Jun 2010 09:27:27 -0400
To: bug-CGI.pm [...] rt.cpan.org
From: Mark Stosberg <mark [...] summersault.com>
Show quoted text
> but old code will suddenly behave differently than it did before. Does > anyone think that this could be an issue for someone out there?
Thanks for the review, Yanick. To answer your question, I'll say: "Yes." Generally, I like the philosophy of leaving CGI.pm alone unless there is clearly a bug. Any change potentially effects thousands of users. For things that are enhancements or maybe-bugs, I think it's reasonable that at least a couple independent parties feel strongly that the behavior should be changed. In this case, it seems like there is a reasonable workaround: an explicit charset can be set if you don't like the default. Mark
Subject: Re: [rt.cpan.org #57945] Charset is only added to text/* types
Date: Thu, 24 Jun 2010 20:15:37 -0400
To: bug-CGI.pm [...] rt.cpan.org
From: Yanick Champoux <yanick [...] babyl.dyndns.org>
On June 18, 2010 09:27:38 am mark@summersault.com via RT wrote: Show quoted text
> > but old code will suddenly behave differently than it did before. Does > > anyone think that this could be an issue for someone out there?
> > To answer your question, I'll say: "Yes." > > Generally, I like the philosophy of leaving CGI.pm alone unless there > is clearly a bug. Any change potentially effects thousands of users. > For things that are enhancements or maybe-bugs, I think it's reasonable > that at least a couple independent parties feel strongly that the > behavior should be changed.
Sounds very reasonable, very prudent. I whole-heartily agree. Joy, `/anick
Gerv, So the summary for now is that we don't plan to modify CGI.pm immediately for this. If more of consensus arises that this should be changed-- and doing so would be more spec-compliant or real-world-compliant, we'll consider revisiting it. We'll leave the ticket open so that others might find it, and you can also promote the change others to gather support for it if you'd like. Thanks again for the feedback on CGI.pm. Mark
Mark, I think you have slightly misunderstood this bug. This bug is not about sending a charset when one is not set, it's about the fact that I want to send a charset and I can't! If Perl code (like mine) explicitly calls: $self->charset("UTF-8") then I want CGI.pm to do what I've explicitly said and send the charset=UTF-8 parameter on the content type. Is that really a change that will break backwards compatibility? The only old code which will behave differently is code which sets a charset but didn't notice it wasn't actually being sent, and is actually talking to a very (very very) old HTTP/1.0 server which breaks when one _is_ sent. In other words, code which actually finally gets what it asks for, but turns out not to want it. If no charset if specified, keep not sending one. I'm fine with that. Can someone ping me with future updates? Having logged in with an OpenID, I don't get mail. gerv@gerv.net. Thanks :-) Gerv
RT-Send-CC: gerv [...] gerv.net
On Tue Jul 20 00:07:33 2010, http://www.gerv.net/ wrote: Show quoted text
> Mark, > > I think you have slightly misunderstood this bug. This bug is not
about Show quoted text
> sending a charset when one is not set, it's about the fact that I want > to send a charset and I can't!
I see. In that case, Yanick's proposed fix is the same thing that I would recommend. I've added some more tests for charset() now and merged and pushed Yanick's work. It should be in our next release. Mark
Subject: patch released for CGI.pm
Thanks for the bug report. A patch for it appeared in 3.51, if not sooner. Resolving. Mark