ioredis: Failed to refresh slots cache

Hi,

We are running ioredis 4.0.0 against an AWS ElastiCache replication group with cluster mode on (2 nodes, 1 shard) and using the cluster configuration address to connect to the cluster.

client = new Redis.Cluster([config.redis]);

and the config part:

  redis: Config({
    host: {
      env: 'REDIS_HOST',
      ssm: '/elasticache/redis/host'
    },
    port: {
      env: 'REDIS_PORT',
      ssm: '/elasticache/redis/port'
    }
  }),

Every time we start the application the logs shows us:

{
    "stack": "Error: Failed to refresh slots cache.\n    at tryNode (/app/node_modules/ioredis/built/cluster/index.js:371:25)\n    at /app/node_modules/ioredis/built/cluster/index.js:383:17\n    at Timeout.<anonymous> (/app/node_modules/ioredis/built/cluster/index.js:594:20)\n    at Timeout.run (/app/node_modules/ioredis/built/utils/index.js:144:22)\n    at ontimeout (timers.js:486:15)\n    at tryOnTimeout (timers.js:317:5)\n    at Timer.listOnTimeout (timers.js:277:5)",
    "message": "Failed to refresh slots cache.",
    "lastNodeError": {
        "stack": "Error: timeout\n    at Object.exports.timeout (/app/node_modules/ioredis/built/utils/index.js:147:38)\n    at Cluster.getInfoFromNode (/app/node_modules/ioredis/built/cluster/index.js:591:34)\n    at tryNode (/app/node_modules/ioredis/built/cluster/index.js:376:15)\n    at Cluster.refreshSlotsCache (/app/node_modules/ioredis/built/cluster/index.js:391:5)\n    at Cluster.<anonymous> (/app/node_modules/ioredis/built/cluster/index.js:171:14)\n    at new Promise (<anonymous>)\n    at Cluster.connect (/app/node_modules/ioredis/built/cluster/index.js:125:12)\n    at new Cluster (/app/node_modules/ioredis/built/cluster/index.js:81:14)\n    at Object.<anonymous> (/app/node_modules/@myapp/myapp-core/core/redis.js:10:12)\n    at Module._compile (module.js:652:30)",
        "message": "timeout"
    },
    "level": "error"
}

What more information do you need so we can get this fixed?

screenshot 2018-09-28 at 16 08 57

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 63 (3 by maintainers)

Most upvoted comments

Please enable the debug mode DEBUG=ioredis:* node yourapp.js and post the logs here.

Here is what I used to get a redis cluster connection using aws elasticache with auth.

const client = new Redis.Cluster(
      [
        {
          host: process.env.REDIS_HOST,
          port: process.env.REDIS_PORT,
        },
      ],
      {
        slotsRefreshTimeout: 2000,
        dnsLookup: (address, callback) => callback(null, address),
        redisOptions: {
          tls: {},
          password: process.env.REDIS_AUTH_TOKEN,
        },
      },
    )

I am also having this issue with the latest ioredis. I have tried setting slotsRefreshTimeout to 2000 and 5000. I have tried adding the cluster endpoints directly. Tried a combination of all. I have tried without cluster config and nothing is working for me. I am using redis 5.0.4 on ElastiCache on AWS. Has anyone gotten this working already? I even tried the redis-clustr package but it does not work either.

Same problem happening for me as well. ElastiCache engine 6.2 (cluster) and ioredis@4.28.2.

Our Services are using AWS Elasticache with Cluster - Enabled auto-failover. Application deployed as docker container in the AWS ECS Cluster.

We were upgraded ioredis version from 2.4.2 to 4.16.1 after that two issues were popped up which are mentioned below.

1). ClusterAllFailedError: Failed to refresh slots cache. 2). CPU utilization is high in the application.

Why didn’t happen above issue in ioredis v2.4.2? The cluster instance does not have the configurable option of the below property and I do not know what is the default.

  • slotRefreshTimeout
  • slotRefreshInterval

IOREDIS – v3.1.0+ Cluster introduced configurable Property.

  • slotRefreshTimeout

IOREDIS – v4.0.0-0+ Cluster introduced configurable Property.

  • slotRefreshInterval

As per ioredis v4.0.0.-0 document mentioned below default value. image

ClusterAllFailedError: Failed to refresh slots cache When the sloftRefreshInterval reached the ioredis is trying to refresh the slot but the issue occurs due to the default timeout value of the refreshing slot.

CPU utilization is high in the application. When ioredis client is trying to refresh the slot every 5sec because sloftRefreshInterval default value is 5seconds(5000ms) so the application CPU utilization was high.

So we have configured the below-mentioned value in our application to fix the issue. slotsRefreshTimeout configured with 10seconds instead of the default value 1seconds. slotsRefreshInterval configured with 5minutes instead of the default value is 5seconds.

let redisClient = new Redis.Cluster([{ 
		host: <Host Name>,
		port: <Port Number>
	}], 
	{
		slotsRefreshTimeout: 10000,
		slotsRefreshInterval: 5 * 60 * 1000,
		redisOptions: {
		retryStrategy(times) {
		  if(!times) times = 2;
		  const delay = Math.min(times * 50, 2000);
		   console.log(`Retrying ${times} times, delayInMilliSeconds ${dely}`);
		  return delay;
		}
	}
);

Hope this information would help you to fix the issue.

I was able to connect to aws with increased slotsRefreshTimeout

const cacheCluster = new IORedis.Cluster(
    [
        {
            host: process.env.elasticache_host,
            port: process.env.elasticache_port,
        },
    ],
    {
        slotsRefreshTimeout: 2000,
        redisOptions: {
            tls: {},
            password: process.env.elasticache_password,
        },
    }
);

Elasticache engine: Clustered Redis Elasticache engine version: 6.2.5 ioredis version: 4.28.3 Cluster: 3 shards, 9 nodes Redis Auth: None

I have one cluster in two different environments that are nearly the same. One cluster has encryption at-rest and in-transit disabled. I have been able to connect to this cluster just fine with:

const connection = new Redis.Cluster([process.env.REDIS_CONFIG_ENDPOINT], {
  enableReadyCheck: false,
  maxRetriesPerRequest: null,
})

However, the other cluster with encryption enabled fails to establish a connection with this same code and displayed the ClusterAllFailedError: Failed to refresh slots cache. error. Interestingly, if I instead pass in every, or some, Redis node in the cluster to the Redis.Cluster constructor as a { host, port }[] object it establishes a connection just fine.

I wanted to take a step back to see if the encryption was the issue and re-provisioned this cluster with encryption disabled and sure enough it works.

We can still see this issue with ElastiCache engine 6.2 (cluster) and ioredis@4.28.2. Any workaround ?

@visitsb First of all, thanks for clarifying this. I have one question, what you say for the cluster mode disabled is to have two separate clients and manually configure reads and writes? For example:

const writer = new Redis({
  port: process.env.REDIS_PRIMARY_ENDPOINT,
})

const reader = new Redis({
  port: process.env.REDIS_READER_ENDPOINT,
})

and then

module.exports = {
    getAsync: reader.get,
    setAsync: writer.set,
}

It this right?

@MatteoGioioso Yes, that is correct. The below blog has this outlined as well as the docs.

Redis Cluster mode: Use Cluster-aware Redis clients and connect to the cluster using the configuration endpoint. This allows the client to automatically discover the shard and slot mappings. Redis Cluster mode also provides online resharding (scale in/out) for resizing your cluster, and allows you to complete planned maintenance and node replacements without any write interruptions. The Redis Cluster client can discover the primary and replica nodes and appropriately direct client-specific read and write traffic.

Non-Redis Cluster mode: Use the primary endpoint for all write traffic. During any configuration changes or failovers, Amazon ElastiCache ensures that the DNS of the primary endpoint is updated to always point to the primary node. Use the reader endpoint to direct all read traffic. Amazon ElastiCache ensures that the reader endpoint is kept up-to-date with the cluster changes in real time as replicas are added or removed. Individual node endpoints are also available but using reader endpoint frees up your application from tracking any individual node endpoint changes. Hence, it’s best to use primary endpoint for writes and single reader endpoint for reads.

Configuring the Redis client https://aws.amazon.com/blogs/database/configuring-amazon-elasticache-for-redis-for-higher-availability/

@ion-willo your configuration appears to be with Cluster mode disabled. In that case, you should directly connect to the Reader Endpoint using new Redis(...) and not use new Redis.Cluster(...). That will do the split for you to evenly read from all read replicas. If you want to update the cache, then using Primary Endpoint is the correct one to use for write operations.

This information is available on the docs too- https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Endpoints.html

You can try setting up another ElastiCache Redis cluster but make sure to select Cluster Mode enabled (Scale Out). Then try with the Configuration Endpoint using new Redis.Cluster(...) and you should be good to go.

Tip: As @luin suggested in some other replies, you can try to connect to redis using redis-cli on a temporary EC2 instance within the same VPC, Security Group as your ElastiCache Redis. Then try to issue a CLUSTER SLOTS command. This command will work only if you use Configuration Endpoint available in Cluster Mode enabled. But this will not work on a Primary Endpoint nor Reader Endpoint since you are directly talking to a single redis node.

Hope this helps.

Works 👍

  • ElastiCache Redis in Cluster Mode Engine Version Compatibility: 5.0.6 Encryption in-transit: yes (hence tls: true needed for redisOptions shown in Lambda snippet below) Nodes: 9 Shards: 3 Primary Endpoint:- (since this is a Cluster, so Configuration Endpoint is relevant instead)

  • My Lambda running on NodeJS 12x, ioredis@4.16.1

Lambda code used

const Redis= require('ioredis');
const client = new Redis.Cluster([{ 
    host: process.env.REDIS_CLUSTER_CONFIGURATION_ENDPOINT 
}], {
    dnsLookup: (address, callback) => callback(null, address),
    redisOptions: {
        tls: true,
        password: process.env.REDIS_CLUSTER_AUTH
    }
});

exports.handler = async (event) => new Promise((resolve, reject) => {
    client.ping('Hello from ElastiCache Redis').then(resolve).catch(reject);
});

I get the below response for my Lambda execution

Response:
"Hello from ElastiCache Redis"

Request ID:
"769caff3-159c-41df-932b-e9134eaf8f6a"

Function Logs:
START RequestId: 769caff3-159c-41df-932b-e9134eaf8f6a Version: $LATEST
END RequestId: 769caff3-159c-41df-932b-e9134eaf8f6a
REPORT RequestId: 769caff3-159c-41df-932b-e9134eaf8f6a	Duration: 483.84 ms	Billed Duration: 500 ms	Memory Size: 128 MB	Max Memory Used: 76 MB	Init Duration: 210.92 ms

Tip:

  • Ensure your ElastiCache and Lambda both use the same VPC, and Security Group. Otherwise your Lambda will keep failing on timeouts (including the error messaged Failed to refresh slots cache) since it cannot connect to ElastiCache anyways. Needless to say slotsRefreshTimeout was un-necessary in my case.
  • REDIS_CLUSTER_CONFIGURATION_ENDPOINT, REDIS_CLUSTER_AUTH passed as environment variables to Lambda. REDIS_CLUSTER_AUTH can be empty if no redis AUTH is set on ElastiCache Redis cluster.
  • I’ve use Lambda Layer for ioredis dependency in my snippet so I can update to newer versions of ioredis without breaking my Lambda unintentionally.

Hope this helps!

Need help to drill down the following issue Please check following needful information for more context

ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:396:31)
at /var/task/node_modules/ioredis/built/cluster/index.js:413:21
at Timeout.<anonymous> (/var/task/node_modules/ioredis/built/cluster/index.js:671:24)
at Timeout.run (/var/task/node_modules/ioredis/built/utils/index.js:156:22)
at listOnTimeout (internal/timers.js:556:17)
at processTimers (internal/timers.js:497:7) {
lastNodeError: Error: timeout
at Object.timeout (/var/task/node_modules/ioredis/built/utils/index.js:159:38)
at Cluster.getInfoFromNode (/var/task/node_modules/ioredis/built/cluster/index.js:668:55)
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:402:19)
at Cluster.refreshSlotsCache (/var/task/node_modules/ioredis/built/cluster/index.js:421:9)
at /var/task/node_modules/ioredis/built/cluster/index.js:192:22
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at runNextTicks (internal/process/task_queues.js:66:3)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
}

I’m facing the same issue where the new lambda version release started throwing a massive number of errors

ioredis version 4.27.6 AWS Lambda nodejs12.x Redis 6.0.5

AWS Redis cluster metrics look good and healthy, it is confirmed by AWS tech support too. Couldn’t able to root cause the issue

Redis Cluster Initilization

        this.client = new redis.Cluster([redisConfig], {
		dnsLookup: (address, callback) => callback(null, address),
		slotsRefreshTimeout: 5000,
		slotsRefreshInterval: 1 * 60 * 1000,
	});

We were having intermittent issues with this error, using Elasticache Redis Cluster and Lambdas. After trying some different config settings, we settled on this one, which has been working 100% without any intermittent slot refresh errors any longer.

{
    scaleReads: 'slave',
    lazyConnect: true,
    slotsRefreshInterval: 2 ** 31 - 1,
    slotsRefreshTimeout: 10000,
    enableOfflineQueue: false,
    dnsLookup: (address, callback) => callback(null, address),
    enableReadyCheck: true,
    redisOptions: {
      connectTimeout: 30000,
      retryStrategy: (times) => {debug({ times }); if (times > 10) return false; return 2 ** times; },
      dropBufferSupport: true,
      keepAlive: 10,
      enableOfflineQueue: false,
      showFriendlyErrorStack: process.env.NODE_ENV !== 'production',
      disconnectTimeout: 1000,
    },
  }

We essentially disabled the periodic refreshing of the slots cache since we only use each connection for a short period and any issues that would arise are less severe than the intermittent timeouts. Setting a high timeout value for connections and the initial slots cache fetching isn’t ideal but we haven’t had any significant spikes in runtime.

We are re-using the same Cluster instance for each unique Lambda invocation by caching the instances by awsRequestId to ensure the same instances aren’t re-used across lambda invocations. We set lazyConnect and call connect manually and then await the ready event to fire before attempting to use the instance. We have some retrying logic of our own wrapped around this connection. We also ensure that every instance is disconnected at the end of every Lambda invocation by calling disconnect if the instance is still in the ready state.

Since implementing this setup we dropped the avg currConnections value per min down to <1% of what it was previously, which is what seemed to be the main cause of the intermittent issues.

I found the solution 😄

Since AWS ElasticCache provides a single endpoint to connect to the cluster we can not use new Redis.Cluster([{HOSt:'', PORT:''}, {HOSt:'', PORT:''}]) and also not use new Redis(ENDPOINT)

but I manage to get a connection by

new Redis.Cluster(['${redisAddress}:${redisPort}'], { scaleReads: 'slave' });

Just noticed that this fix will not work on Redis engine 5.0 on Elasticache. Downgrade to 4.0.10 to make it work.

works on me when I changed slotsRefreshTimeout to 5000

Any updates on this? I’m still experiencing this issue with Redis version 5.0.5 on AWS ElastiCache Clustered, ioredis version: 4.16.1

Edit: Previously said I was using 4.16.0 version of ioredis while it was actually 4.16.1.

Still happening for me as well. Elasticache engine 5.0.3 (cluster) and ioredis@4.9.0

I think I found something that will help. I have been passing the single Configuration Endpoint URL to Redis.Cluster() (as the original author of this issue has been doing).

{ host: clustercfg.mycluster.usw2.cache.amazonaws.com port: 6379 }

I tried instead passing the endpoints directly and it appears to work.

const nodes = [ { host: ‘my-cluster-0001-001.xxx.usw2.cache.amazonaws.com’, port: 6379, }, { host: ‘my-cluster-0001-002.xxx.usw2.cache.amazonaws.com’, port: 6379, }, { host: ‘my-cluster-0001-003.xxx.usw2.cache.amazonaws.com’, port: 6379, }, ]

I obtained the above by running the following (I only have one cluster with three nodes).

aws elasticache describe-cache-clusters --show-cache-node-info | jq -r ‘.CacheClusters[].CacheNodes[].Endpoint’

Funny thing is that it (“Redis.Cluster”) sometimes does work with the configuration endpoint–it just seems to have some race condition that prevents it from always working.

Out of curiosity I tried using “new Redis()” passing the cluster configuration endpoint and got:

ReplyError: MOVED 11958 ‘my-cluster-0001-001.xxx.usw2.cache.amazonaws.com’:6379