requests: HeaderParsingError: Failed to parse headers

INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1
DEBUG:requests.packages.urllib3.connectionpool:"POST /kkblog/ HTTP/1.1" 201 None
WARNING:requests.packages.urllib3.connectionpool:Failed to parse headers (url=http://127.0.0.1:5984/kkblog/): [MissingHeaderBodySeparatorDefect()], unparsed data: '³é\x97\xad\r\nETag: "1-967a00dff5e02add41819138abb3284d"\r\nDate: Fri, 15 Apr 2016 14:45:18 GMT\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: 69\r\nCache-Control: must-revalidate\r\n\r\n'
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 390, in _make_request
    assert_header_parsing(httplib_response.msg)
  File "/usr/lib/python3.5/site-packages/requests/packages/urllib3/util/response.py", line 59, in assert_header_parsing
    raise HeaderParsingError(defects=defects, unparsed_data=unparsed_data)
requests.packages.urllib3.exceptions.HeaderParsingError: [MissingHeaderBodySeparatorDefect()], unparsed data: '³é\x97\xad\r\nETag: "1-967a00dff5e02add41819138abb3284d"\r\nDate: Fri, 15 Apr 2016 14:45:18 GMT\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: 69\r\nCache-Control: must-revalidate\r\n\r\n'

here is the same request with curl:

curl -v -X POST 127.0.0.1:5984/kkblog/ -H "Content-Type: application/json" -d '{"_id": "关闭"}'
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
> POST /kkblog/ HTTP/1.1
> Host: 127.0.0.1:5984
> User-Agent: curl/7.47.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 17
> 
* upload completely sent off: 17 out of 17 bytes
< HTTP/1.1 201 Created
< Server: CouchDB/1.6.1 (Erlang OTP/18)
< Location: http://127.0.0.1:5984/kkblog/关闭
< ETag: "3-bc27b6930ca514527d8954c7c43e6a09"
< Date: Fri, 15 Apr 2016 15:13:14 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 69
< Cache-Control: must-revalidate
< 
{"ok":true,"id":"关闭","rev":"3-bc27b6930ca514527d8954c7c43e6a09"}
* Connection #0 to host 127.0.0.1 left intact

the problem is Location: http://127.0.0.1:5984/kkblog/关闭 in the response header, I tried other Chinese chars but they didn’t cause Exception.

>>> '关闭'.encode('utf-8')
b'\xe5\x85\xb3\xe9\x97\xad'
>>>

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 23 (10 by maintainers)

Most upvoted comments

@fake-name I don’t recall being puritanical about anything. Here is, word for word, what I said (literally quoting myself from this thread):

The header parsing is done by httplib, in the Python standard library; that is the part that failed to parse. The failure to parse is understandable though: servers should not be shoving arbitrary bytes into headers.

Using UTF-8 for your headers is extremely unwise, as discussed by RFC 7230:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII].

In this instance it’s not really possible for us to resolve the problem. The server should instead be sending urlencoded URLs, or RFC 2047-encoded header fields. Either way, httplib is getting confused here, and we can’t really step in and stop it.

Note the key part of this comment: “Either way, httplib is getting confused here, and we can’t really step in and stop it.”.

This is what I mean when I say “it’s not really possible for us to resolve the problem”. The issue here is in a helper library that sits in the Python standard library. Changing the header parsing logic of that standard library module, while possible, is something that needs to be done as part of the standard CPython development process. Requests already carries more subclasses and monkeypatches to httplib than we’re happy with, and we’re strongly disinclined to carry more.

So here are the options for resolving this issue:

File a bug with the CPython development team, get the bug fixed and into Python 3.6 or 3.7, upgrade to that Python version.
Work out what the minimal possible invasive change is to httplib in order to allow parsing UTF-8 headers, propose that patch to us to see what we think of it.
Write a patch that removes Requests’ requirement to use httplib at all so that the next time we have a problem that boils down to “httplib is stupid”, we don’t have to have this argument again.

Now, you are welcome to pursue any of those options, but I’ve been to this rodeo a few times so I’m pursuing (3), which is the only one that actually makes this problem go away for good. Unfortunately, it turns out that replacing our low-level HTTP stack that we have spent 7 years integrating with takes quite a lot of work, and I can’t just vomit out the code to fix this on demand.

To sum up: I didn’t say I didn’t think this was a problem or a bug, I said it was a problem that the Requests team couldn’t fix, at least not on a timescale that was going to be helpful to this user. If you disagree, by all means, provide a patch to prove me wrong.

And let me make something clear. For the last 9 months or so I have been the most active Requests maintainer by a long margin. Requests is not all I do with my time. I maintain 15 other libraries and actively contribute to more. I have quite a lot of stuff I am supposed to be doing. So I have to prioritise my bug fixing.

Trust me when I say that a bug where the effort required to fix it is extremely high and the flaw comes from a server that is emitting non-RFC-compliant output, that’s not a bug that screams out “must be fixed this second”. Any time there is a bug predicated on the notion that our peer isn’t spec compliant that bug drops several spaces down my priority list. Postel was wrong.

Browsers are incentivised to support misbehaving servers because they are in a competitive environment, and users only blame them when things go wrong. If Chrome doesn’t support ${CRAPPY_WEBSITE_X} then Chrome users will just go to a browser that does when they need access.

That’s all fine and good, but the reason that Requests doesn’t do this is because we have two regular developers. That’s it. There are only so many things two developers can do in a day. Neither of us work on just Requests. Compare this to Chrome, which has tens of full-time developers and hundreds of part-time ones. If you want Requests to work on every site where Chrome does, then I have bad news for you my friend because it’s just never going to happen.

I say all of this to say: please don’t berate the Requests team because we didn’t think your particular pet bug was important. We prioritise bugs and close ones we don’t think we’ll fix any time soon. If you would like to see this bug fixed, a much better option is to write the patch yourself. Shouting at me does not make me look fondly on your request for assistance.

Lukasa on Oct 24, 2016

I apologize. I assumed you were holding an opinion that you were not, and proceeded to be a complete ass.

In any event, I don’t disagree that this is a issue with the core library, but, well, in my experience complaining about encoding issues in the core library is non-productive (I have an issue with the built-in ftplib where it decodes some messages as iso-8859-1 even when in utf-8 mode, which I was only able to solve by monkey-patching the stdlib).

Anyways, Assuming you’re OK with monkey patching, here’s a simple snippet that monkey-patches http.client to make it much, MUCH more robust to arbitrary header encodings:



try:
    import cchardet as chardet
except ImportError:
    import chardet as chardet

import http.client
import email.parser

def parse_headers(fp, _class=http.client.HTTPMessage):
    """Parses only RFC2822 headers from a file pointer.

    email Parser wants to see strings rather than bytes.
    But a TextIOWrapper around self.rfile would buffer too many bytes
    from the stream, bytes which we later need to read as bytes.
    So we read the correct bytes here, as bytes, for email Parser
    to parse.

    Note: Monkey-patched version to try to more intelligently determine
    header encoding

    """
    headers = []
    while True:
        line = fp.readline(http.client._MAXLINE + 1)
        if len(line) > http.client._MAXLINE:
            raise http.client.LineTooLong("header line")
        headers.append(line)
        if len(headers) > http.client._MAXHEADERS:
            raise HTTPException("got more than %d headers" % http.client._MAXHEADERS)
        if line in (b'\r\n', b'\n', b''):
            break


    hstring = b''.join(headers)
    inferred = chardet.detect(hstring)
    if inferred and inferred['confidence'] > 0.8:
        print("Parsing headers!", hstring)
        hstring = hstring.decode(inferred['encoding'])
    else:
        hstring = hstring.decode('iso-8859-1')

    return email.parser.Parser(_class=_class).parsestr(hstring)

http.client.parse_headers = parse_headers

Note: This does require cchardet, or chardet. I’m open to better ways to determine the encoding. It simply overrides the http.client.parse_headers() member of the stdlib, which is kind of squicky.

Splatting the above into a file, and then just importing it at the beginning of the requests/__init__.py file seems to solve the problem.

fake-name on Oct 25, 2016

And again, whether it’s compliant is irrelevant. There are servers out that that act like this. And my browser, and cURL are completely happy talking to them, yet requests explodes.

Hell, I’m interacting with cloudflare, and it’s serving UTF-8 headers. So basically unicode header support is massively, MASSIVELY deployed and available, standards be damned.

If you’re dead set on being puritanical about RFC support, the only people who are harmed are people who want to use the requests library.

fake-name on Oct 24, 2016

In this instance it’s not really possible for us to resolve the problem. The server should instead be sending urlencoded URLs, or RFC 2047-encoded header fields. Either way, httplib is getting confused here, and we can’t really step in and stop it.

So… what about the thousands and thousands of potential servers that I can’t just SSH into and fix?

Basically, “just follow the RFC” is a complete non-answer, because I don’t control the world (I’m taking minion applications, though!).
The fact is, servers out there serve UTF-8 headers. This is not something fixable, because they’re not my server. My web browser handles this situation just fine, so it’s clearly possible to make this situation work.

As it is, requests fails on these servers. This is fixable, because I control the code on my local machine.

fake-name on Oct 24, 2016