redis-py: BaseException at I/O corrupts Connection
tl;dr repeat of https://github.com/redis/redis-py/issues/360
My codebase uses gevent (w/monkey-patching) for concurrent workers, and schedules a gevent.Timeout to force a deadline on the workers. Since a timeout causes an exception to be raised from an arbitrary “yield point”, there’s risk of corrupting shared state if code is not written to be exception-safe.
The code talks to redis over a client that’s shared between the workers. Socket I/O is a “yield point”. Sending a command successfully, but then failing to read the entire response off the socket, gets Connection into an undefined state: a subsequent command would try to parse a response intended for a previous command, and in many cases would fail. Here’s what a typical error would look like:
File "site-packages/redis/client.py", line 3401, in parse_response
result = Redis.parse_response(
File "site-packages/redis/client.py", line 774, in parse_response
return self.response_callbacks[command_name](response, **options)
File "site-packages/redis/client.py", line 528, in <lambda>
'SET': lambda r: r and nativestr(r) == 'OK',
File "site-packages/redis/_compat.py", line 135, in nativestr
return x if isinstance(x, str) else x.decode('utf-8', 'replace')
AttributeError: 'int' object has no attribute 'decode'
As you can see, we’re trying to parse an integer response to a previous command (e.g. ":12345\r\n") as a string response to a SET command.
The except Exception block in redis.connection.Connection.read_response is intended to handle all sorts of parsing errors by disconnecting the connection:
https://github.com/redis/redis-py/blob/6219574b042a6596b150ca8248441198f01f8c87/redis/connection.py#L819-L821
but perhaps it could be changed to except: since e.g. gevent.Timeout is intentionally a BaseException to suggest that it should not be suppressed (and we are indeed not suppressing it).
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 52 (36 by maintainers)
Digging through history:
PubSub.on_connectseems to re-establish subscriptions, but there’s probably a brief loss)@Chronial Definitely an attention-grabbing post mortem these days 😄
When it happened back in our servers at Lyft’s Bikes & Scooters, it was very intermittent — it took maybe 1 day for a single python node to get into a broken state (for a single connection in a pool, so even the same node didn’t always exhibit a problem), and often someone would term that node to “fix” it, and move on.
Then one day I took a deep dive, pinned py-redis < 4.4 and added a local regression test to prevent a kind soul from upgrading blindly, but not even a post mortem…
First, let’s agree that all things being equal, we’d rather execute the least code possible to handle a
BaseException. If a complex recovery is necessary, it’s best to “earmark” that connection as needing extra care, rather than perform that extra care in the error handler. Overall, writing interrupt-safe code is hard.The RESP protocol, while elegantly simple, doesn’t seem to address cancellations. It’s like the “touch-move” rule in chess.
When a
sendis interrupted:-1, it means nothing was sent over the wire and we’re in a defined state.0 < n < len(payload), then we’re in undefined state. There’s no way to signal a proverbial “^C” over RESP. For example, if we meant to send “EXISTS foo\r\n”, we might’ve only sent “EXI”, and next time the app code calls us, it might be “GET bar\r\n” and so Redis will effectively get “EXIGET bar\r\n”. We could address it by having a write buffer and using the next opportunity to first flush the commands we couldn’t send before, but that risks sending a command that’d be stale according to the app logic, so I’m not sure we’d be following the principle of least astonishment here.len(payload), then we’re in a defined state, but the client should be mindful to skip the response on the next operation. Generalized, a client (who sent the command) should keep a counter of expected responses for that connection, and skip all but the last one.(This is all made complicated by the fact we use
sendallwith a pipeline of commands, but let’s solve for the simple case before we even go there.)If we chose to add an ‘expected responses’ counter, it’d have to be per-connection rather than a single counter on
Redisitself. Architecturally it’d be probably easier to add a property toConnectionthan keep a separate mapping inRedis. But alas I don’t see it addressing the0 < send("command\r\n") < len("command\r\n")scenario, and the low overhead of reconnecting probably trumps the complexity.–
As for PubSub, the situation there is a bit better:
get_messageis probably interruptible, and I could see @kristjanvalur 's surprise at getting disconnected even though the connection is in defined state.0 < send("command\r\n") < len("command\r\n")is addressed there.–
P.S. I went through this with the assumption of interruption points being I/O & yields (in gevent) or awaits (in asyncio). I’m sure a
KeyboardInterruptthat can happen at any opcode boundary brings surprises of its own.