Description
For a long time we've had a fallback value in response.encoding
of ISO-8859-1
, because RFC 2616 told us to. RFC 2616 is now obsolete, replaced by RFCs 7230, 7231, 7232, 7233, 7234, and 7235. The authoritative RFC on this issue is RFC 7231, which has this to say:
The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says.
The media type definitions for text/*
are most recently affected by RFC 6657, which has this to say:
In accordance with option (a) above, registrations for "text/*" media types that can transport charset information inside the corresponding payloads (such as "text/html" and "text/xml") SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the "charset" parameter value and the value specified in the payload disagree.
I checked the registration for text/html
here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.
I propose the following changes:
- Remove the ISO-8859-1 fallback, as it's no longer valid (being only enforced by RFC 2616). We should definitely do this.
- Consider writing a module that has the appropriate fallback encodings for other
text/*
content and use them where appropriate. This isn't vital, just is a "might be nice". - Begin checking HTML content for meta tags again, in order to appropriately fall back. This is controversial, and we'll want @kennethreitz to consider it carefully.