requests: utils.get_encoding_from_headers returns ISO-8859-1 incorrectly

When I call get_encoding_from_headers on this url:

http://thelastpsychiatrist.com/2012/02/my_fiancee_is_pushing_me_away.html

The response is ISO-8859-1:

(Pdb) get_encoding_from_headers(self.response.headers)
'ISO-8859-1'

Even though the headers don’t contain that characterset:

(Pdb) self.response.headers
{'date': 'Sun, 11 Mar 2012 21:10:40 GMT', 'transfer-encoding': 'chunked', 'content-type': 'text/html', 'server': 'Apache/2.2.22'}

It looks like this was an intentional choice in the source, but this is problematic for me because, if I knew that the encoding was guessed, I’d want to check the HTML meta tag myself - which would then properly parse as UTF-8.

I think the better solution for is to either return None explicitly, or provide a default kwarg param that people could set to an encoding manually if they wanted to.

I can patch this if it sounds like a good solution.

About this issue

Original URL
State: closed
Created 12 years ago
Comments: 16 (16 by maintainers)

Most upvoted comments

For future reference to anyone who stumbles upon this, the spec is:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1

The “charset” parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the “text” type are defined to have a default charset value of “ISO-8859-1” when received via HTTP. Data in character sets other than “ISO-8859-1” or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

umbrae on Mar 31, 2012