ioredis: Cluster: Fail to reconnect to node

I’m running a simple test with 3 masters + 1 slave per master. If I take down one of the slaves and restart it, ioredis fails to reconnect to that node.

  ioredis:redis status[127.0.0.1:6421]: ready -> close +7s
  ioredis:connection skip reconnecting because `retryStrategy` is not a function +0ms
  ioredis:redis status[127.0.0.1:6421]: close -> end +0ms
Received -node: "127.0.0.1:6421"

Note, however, that retryStrategy is a function. I’ve repeated this failure with 1) no options supplied new Cluster, 2) with options, but not setting options.redisOptions.retryFunction and 3) with option, but setting options.redisOptions.retryFunction to function() { return 3000; }.

Even long after the node has been restarted and rejoined the cluster (verified with redis-cli cluster slots and cluster nodes), ioredis continues to report the cluster.nodes().length = 5.

So, the only way to recover from this very basic failure mode is for the application to monitor the ‘-node’ event and destroy and re-create the client. This is a bad defect.

I would have hoped that ioredis was periodically monitoring the cluster configuration by periodically performing a cluster slots command, it is clearly impossible for ioredis to detect if new slaves are ever added to the cluster in an expansion scenario.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 1
  • Comments: 25 (1 by maintainers)

Most upvoted comments

@shaharmor , thanks for the response. I could try a slot change, but I believe this also is not immediately detected, but only if an operation results in a MOVED response.

Note that I’ve raised a couple other related issues regarding how ioredis manages cluster connections. In this particular issue, there is no real change in the cluster topology. Rather one of the nodes failed and ioredis decides to never attempt to reconnect to that node. If you query any node in cluster, that temporarily failed node was never removed from the cluster, yet ioredis decides that it is gone forever. In this scenario, ioredis should not unilaterally remove the node from its view of the cluster, but should continue to retry re-connecting to that node. As I noted above error may be related to the following debug output: ioredis:connection skip reconnecting because `retryStrategy` is not a function +0m yet in my configuration of ioredis, retryStrategy IS a function. I suspect that there is an error in handling the options passed to Cluster().

We have major problems with redis not reconnecting when redis labs does maintanence on our cluster, I believe this is the underlying issue. Why hasn’t any progress been made in 4 years?

So I think the discussion has gotten off topic of the original issue. I did a bit more digging and found that in cluster mode, if a node disconnects (e.g., the redis server restarts), that there is explicitly no attempt to reconnect to that node - even if the client specified a retryStrategy() in options.redisOptions. Specifically in connection_pool.js:

    redis = new Redis(_.defaults({
      // Never try to reconnect when a node is lose,
      // instead, waiting for a `MOVED` error and
      // fetch the slots again.
      retryStrategy: null,

I believe this is wrong. The connection could be lost due do a simple node restart (machine restart, redis server upgrade, etc). In these scenarios, there would never be a MOVED error and so we never reconnect to this node.

Further, the code overrides the client specified options and forces offline queuing to be enabled.

      // Offline queue should be enabled so that
      // we don't need to wait for the `ready` event
      // before sending commands to the node.
      enableOfflineQueue: true,

It’s not clear why this needs to overridden. In my use cases, because I am using redis as a true cache, I have special handling when operations to update the cache can’t complete. I may have other clients also writing to the cache and if those commands start to get queued up and then later dequeued and executed, those commands could get executed out-of-order and the cache has the wrong data.

The fact that these two options are being overridden is not documented. I’d like to see these overrides removed.