On Fri May 28 08:37:50 2010, mark@summersault.com wrote:
Show quoted text> Thanks for the report, Gerv.
>
> Could you review the related RFCs for us to review what they have to
> say when charsets should or should not be added to content-type
> headers?
Hmm. Sorry for the delay. I don't get email on changes (having used
OpenID) and I can't see a "profile" or similar link in RT where I can
add an email address...
Here are some bits of relevant text:
RFC 2616 (HTTP), section 3.4.1:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html
"3.4.1 Missing Charset
Some HTTP/1.0 software has interpreted a Content-Type header without
charset parameter incorrectly to mean "recipient should guess." Senders
wishing to defeat this behavior MAY include a charset parameter even
when the charset is ISO-8859-1 and SHOULD do so when it is known that it
will not confuse the recipient.
Unfortunately, some older HTTP/1.0 clients did not deal properly with an
explicit charset parameter. HTTP/1.1 recipients MUST respect the charset
label provided by the sender; and those user agents that have a
provision to "guess" a charset MUST use the charset from the
content-type field if they support that charset, rather than the
recipient's preference, when initially displaying a document. See
section 3.7.1."
Section 3.7.1 states, in part:
"The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text" type
are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or its
subsets MUST be labeled with an appropriate charset value."
RFC 4627 (JSON) says:
http://www.ietf.org/rfc/rfc4627.txt
"JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
It then talks about sniffing the stream to be able to tell if other
encodings have, in fact, been used. Ick.
So one could argue:
"Types other than text types are supposed to have a canonical
representation defined when they are registered. For JSON, the canonical
representation is "Unicode", and the defined method of determining the
character set is by sniffing the first four bytes of the stream."
Or one could argue:
"My word, that's horrible. Add a charset parameter, for the love of Pete!"
As for XML, the situation is more clear. RFC 3023 says:
http://www.rfc-editor.org/rfc/rfc3023.txt
"3.2 Application/xml Registration
MIME media type name: application
MIME subtype name: xml
Mandatory parameters: none
Optional parameters: charset
Although listed as an optional parameter, the use of the charset
parameter is STRONGLY RECOMMENDED, ..."
So that's at least one non-text type where a charset should be added.
In general, I suspect most XML MIME types (and there are _a_lot_) have
this attitude to charsets. People need to be able to set this parameter.
Do you need more from me here?
Gerv