httpcore: Request errors on overloaded systems may poison the connection pool
It is somewhat related to #502 as we started first observing the below issue in httpcore 0.14.5. Upgrading to 0.14.7 seemingly fixed the issue but we found out that there is still some underlying race condition that can poison the pool, making it unable to serve requests ever again. The issue came up using httpx==0.22.0 httpcore==0.14.7, it is not present using httpx==0.16.1 httpcore==0.12.3
Here is a script brute forcing the client and reliably triggering the issue on my machine. (The script is written for the sync version, but we are having this issue in production with the async version as well).
Repro script
PARALLELISM and REQUESTS may need adjustments to reproduce the issue.
import multiprocessing.pool
import socket
import httpx
PARALLELISM = 20
REQUESTS = 200
unavailable_listener = socket.socket()
unavailable_listener.bind(("localhost", 0))
unavailable_listener.listen()
unavailable_url = "http://{}:{}".format(*unavailable_listener.getsockname())
client = httpx.Client(limits=httpx.Limits(max_connections=1, max_keepalive_connections=0))
pool = multiprocessing.pool.ThreadPool(PARALLELISM)
# client.get(...) works well at this point
# Raises httpcore.PoolTimeout as expected because the pool has been exhausted
pool.map(lambda _: client.get(unavailable_url), range(REQUESTS))
# client.get(...) will still raise PoolTimeout even though there supposed to be no outstanding connections
After running the above code, the pool is no longer operational, containing a stuck connection:
>>> client._transport._pool._pool
[<HTTPConnection [CONNECTING]>]
>>> client._transport._pool._pool[0]._connect_failed
False
>>> client._transport._pool._pool[0]._connection
>>>
>>> client._transport._pool._pool[0].is_avaliable()
>>> False
My hunch is that RequestStatus.set_connection() will succeed from _attempt_to_acquire_connection but status.wait_for_connection() can still time-out on an overloaded system leaving the half-created connection there. Subsequent request attempts won’t be able to use this connection instance as it isn’t available, and it won’t be cleaned up as it is not closed nor failed.
The interesting about this situation is that a just crafted connection can be put in the pool without guaranteeing that it will be used during the current requests.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 12
- Comments: 18 (3 by maintainers)
Commits related to this issue
- Fix pool poisoning on overloaded systems (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Tests for pool poisoning (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Tests for pool poisoning (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fixes imports (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix imports and black (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
- Fix linters (#550) — committed to nihilSup/httpcore by deleted user a year ago
I experimented with versions
0.16.3,0.17.1and0.18.0.Having some long-responding server on
localhost:5522the following code fails with PoolTImeout. You can get and run such a long-responding server from this comment: https://github.com/encode/httpx/issues/1171#issuecomment-1850923234I expect that the following code will not raise PoolTimeout, especially when I canceled all the tasks and waited for an additional 5 seconds for cancelation to complete.
My case is probably related to the issue discussed here.
Facing this or very similar behaviour under some load in production microservice after upgrading httpx to (
httpx ==0.23.0) and python to 3.10. After some time and several ConnectTimeout/ReadTimeout exactly all requests are finished with PoolTimeout. Downgrade tohttpx ==0.18.1solved the issue.Could reproduce issue with pure httpcore (adapted @vlaci script, `httpcore==0.16.3`)
Some observations
Every task is trying to acquire connection (`AsyncConnectionPool._attempt_to_acquire_connection` in `AsyncConnectionPool.handle_async_request`).First one (lets name it t1) succeeded and passes through
status.wait_for_connectioninAsyncConnectionPool.handle_async_requestwhile other tasks hang here.Then t1 can create a real connection in
connection.handle_async_request(request)but fails with ReadTimeout.This leads to call
self.response_closed(status)and real connection is closed and connection wrapper is removed from the pool, request status is removed from separate register either. After this new request status creates another connection wrapper with empty connection. Looks like under normal conditions acquiring of new connection wrapper triggersstatus.wait_for_connectionin some task and then new real connection can be established so on and so forth. But under some load asyncio event loop is not fast enough to change coroutines and happens PoolTimeout. So no one can establish real connection (because all tasks has been finished with PoolTimeout and removed) and there is connection wrapper inside pool without real connection.Now two things can happen.
First one is pool receives new request to some different url. It tries to handle it but hangs until PoolTimeout on
status.wait_for_connectionbecause pool can’t create new connection wrapper and old one neither can process request nor can be removed (connection wrapper has no real connection so it is not idle or expired, it is “CONNECTING”).Second is pool receives new request to same url as it is in connection wrapper inside pool (“CONNECTING” one). In this case request can’t be processed because connection wrapper can handle request but wrapper is not available at the same time (failed is False and connection is None and httpx2 property is False either)
I see two approaches:
I tried second approach and it worked well - no pool timeouts and “control” requests are successfull:
Can confirm that I’ve also reproduced the issue.
I used adapted @nihilSup's example slightly, using `trio`...
I was able to confirm the
ReadError/PoolTimeoutcombination caused the error, and that it resulted in a state where we’re handling thePoolTimeout, but the connection has been assigned to the waiting request.I was also able to confirm the suggested fix, although I’m not yet entirely comfortable with it. It seems a little brittle, and I’d prefer to understand first if we’re able to avoid this state.
I wonder if we’re able to reliably reproduce this state, using a mocked out network backend similar to some of the cases in
tests/_async/test_connection_pool.py?Not stale.
An observation after I played with your test code locally @nihilSup
We can reproduce this zombie situation without
ReadTimeout, and even, without the hanging backend:under
tests/_async/test_connection_pool.pyI think the reason why this achieves the same outcome is because as we set timeout to 0,
status.wait_for_connection(timeout=0)will immediately raise PoolTimeout, but before that it was assigned a connection in https://github.com/encode/httpcore/blob/7eb20224833b8165d0945d08ac8f714ccb6750b9/httpcore/_async/connection_pool.py#L221 This could be a more minimal reproduction of the scenario. But the difference is that the connection assignment in real situation happened in another request by the ReadTimeout.@tomchristie On weekend I prepared draft PR with tests both for sync and async implementations.
httpcore/backends/mock.py
tests/_async/test_connection_pool.py
tests/_sync/test_connection_pool.py
assert pool.connections[0].is_connecting()can be skipped because it contains my helper method@tomchristie @florimondmanca apologies for the ping, but can you take a look at the comments here? It seems like an important issue.