redis-py: BaseException at I/O corrupts Connection

tl;dr repeat of https://github.com/redis/redis-py/issues/360

My codebase uses gevent (w/monkey-patching) for concurrent workers, and schedules a gevent.Timeout to force a deadline on the workers. Since a timeout causes an exception to be raised from an arbitrary “yield point”, there’s risk of corrupting shared state if code is not written to be exception-safe.

The code talks to redis over a client that’s shared between the workers. Socket I/O is a “yield point”. Sending a command successfully, but then failing to read the entire response off the socket, gets Connection into an undefined state: a subsequent command would try to parse a response intended for a previous command, and in many cases would fail. Here’s what a typical error would look like:

  File "site-packages/redis/client.py", line 3401, in parse_response
    result = Redis.parse_response(
  File "site-packages/redis/client.py", line 774, in parse_response
    return self.response_callbacks[command_name](response, **options)
  File "site-packages/redis/client.py", line 528, in <lambda>
    'SET': lambda r: r and nativestr(r) == 'OK',
  File "site-packages/redis/_compat.py", line 135, in nativestr
    return x if isinstance(x, str) else x.decode('utf-8', 'replace')
AttributeError: 'int' object has no attribute 'decode'

As you can see, we’re trying to parse an integer response to a previous command (e.g. ":12345\r\n") as a string response to a SET command.

The except Exception block in redis.connection.Connection.read_response is intended to handle all sorts of parsing errors by disconnecting the connection: https://github.com/redis/redis-py/blob/6219574b042a6596b150ca8248441198f01f8c87/redis/connection.py#L819-L821 but perhaps it could be changed to except: since e.g. gevent.Timeout is intentionally a BaseException to suggest that it should not be suppressed (and we are indeed not suppressing it).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 52 (36 by maintainers)

Most upvoted comments

Digging through history:

  • this was first reported in #360 in 2013
  • then regressed in 3.0.0
  • then again reported in #1128 and fixed in #1129 (cc @Chronial), released in 3.2.0
    • but alas without a regression test, which didn’t add any controls for it being reverted
  • then #2103 got reported
  • then got reverted in #2104 (cc @kristjanvalur), released in 4.4.0rc2
    • Apparent motivation: pubsub without disconnects (though PubSub.on_connect seems to re-establish subscriptions, but there’s probably a brief loss)

@Chronial Definitely an attention-grabbing post mortem these days 😄

When it happened back in our servers at Lyft’s Bikes & Scooters, it was very intermittent — it took maybe 1 day for a single python node to get into a broken state (for a single connection in a pool, so even the same node didn’t always exhibit a problem), and often someone would term that node to “fix” it, and move on.

Then one day I took a deep dive, pinned py-redis < 4.4 and added a local regression test to prevent a kind soul from upgrading blindly, but not even a post mortem…

First, let’s agree that all things being equal, we’d rather execute the least code possible to handle a BaseException. If a complex recovery is necessary, it’s best to “earmark” that connection as needing extra care, rather than perform that extra care in the error handler. Overall, writing interrupt-safe code is hard.

The RESP protocol, while elegantly simple, doesn’t seem to address cancellations. It’s like the “touch-move” rule in chess.

When a send is interrupted:

  • If it returns -1, it means nothing was sent over the wire and we’re in a defined state.
  • If it returns 0 < n < len(payload), then we’re in undefined state. There’s no way to signal a proverbial “^C” over RESP. For example, if we meant to send “EXISTS foo\r\n”, we might’ve only sent “EXI”, and next time the app code calls us, it might be “GET bar\r\n” and so Redis will effectively get “EXIGET bar\r\n”. We could address it by having a write buffer and using the next opportunity to first flush the commands we couldn’t send before, but that risks sending a command that’d be stale according to the app logic, so I’m not sure we’d be following the principle of least astonishment here.
  • If it returns len(payload), then we’re in a defined state, but the client should be mindful to skip the response on the next operation. Generalized, a client (who sent the command) should keep a counter of expected responses for that connection, and skip all but the last one.

(This is all made complicated by the fact we use sendall with a pipeline of commands, but let’s solve for the simple case before we even go there.)

If we chose to add an ‘expected responses’ counter, it’d have to be per-connection rather than a single counter on Redis itself. Architecturally it’d be probably easier to add a property to Connection than keep a separate mapping in Redis. But alas I don’t see it addressing the 0 < send("command\r\n") < len("command\r\n") scenario, and the low overhead of reconnecting probably trumps the complexity.

As for PubSub, the situation there is a bit better:

  • The PubSub RESP is designed in such a way that the responses “echo” the request, so you can tell if the response pertains to your request or not. This is obviously required since with PubSub you’d be getting pushed messages in addition to responses.
  • For the same reason, get_message is probably interruptible, and I could see @kristjanvalur 's surprise at getting disconnected even though the connection is in defined state.
  • I’m still not seeing how 0 < send("command\r\n") < len("command\r\n") is addressed there.

P.S. I went through this with the assumption of interruption points being I/O & yields (in gevent) or awaits (in asyncio). I’m sure a KeyboardInterrupt that can happen at any opcode boundary brings surprises of its own.