redisson: RLock can only be obtained by single redisson node after failover

When utilizing RLock in a redis cluster, all redisson nodes are able to obtain a lock (as expected). After performing a redis CLUSTER FAILOVER command on the slave within the shard that holds the lock, the only instance of redisson that is able to obtain that lock going forward is the one which had the lock at the time of the redis failover. This same problem exists with base RLock, as well as FairLock and FencedLock locks.

Redis configurations: 3 shard cluster, 1 slave per shard. Problem exists in simple local redis server (via redis-cli, grokzen/redis-cluster:latest). Also reproducible using AWS elasticache.

Expected behavior After redis cluster failover is complete, all redisson clients are still able to obtain an RLock. Actual behavior Only the redisson instance which held the lock at the time of failover is ever able to obtain the named lock. Only killing the redisson node which originally held the lock allows other nodes to obtain the lock in the future. Steps to reproduce or test case

Create 2 instances of a redisson application, have them both attempt to obtain and release the same named RLock
Call “redis-cli CLUSTER FAILOVER” on the SLAVE node of the MASTER which holds the lock

Observe that before step 2, both redisson nodes are able to obtain lock. Observe that after step 2, only the redisson instance which held the lock at time of step 2 will ever obtain the lock again.

Redis version 6.2, 7 Redisson version 3.19.1 Redisson configuration Config config = new Config(); ClusterServersConfig clusterServers = config.useClusterServers() .setRetryInterval(3000) .setTimeout(30000); rest of the settings are default

Simple test case to reproduce (run this with 2 different apps).

    LockFailoverFailApp() {
        LOG.info("Starting up test...");
        RLock distributedLock = redissonConnection.getRedisson().getLock("distributed_lock");

        LOG.info("starting main loop in " + this.getClass().getName());

        while (true) {
            try {
                LOG.info("My node ID: {}\tgetting lock, is currently locked: {}", redissonConnection.getRedisson().getId(), distributedLock.isLocked());
                if (!distributedLock.tryLock(30, TimeUnit.SECONDS)) {
                    LOG.info("unable to get lock within 30 sec, will try again");
                    continue;
                }

                LOG.info("\"My node ID: {}\tobtained lock, beginning sleep to emulate work", redissonConnection.getRedisson().getId());
                Thread.sleep(5000);
                LOG.info("My node ID: {}\treleasing lock", redissonConnection.getRedisson().getId());
                distributedLock.unlock();
                LOG.info("released lock");
            } catch (Exception ex) {
                LOG.error("Caught exception", ex);
            }
        }
    }

The entire repo for this test application is here: (please see class LockFailoverFailApp) https://github.com/servionsolutions/redisson-testbed/blob/main/app/src/main/java/org/example/redissonfailover/LockFailoverFailApp.java

I am happy to perform any testing and provide any logs desired. This happens the majority of the time, very quick and easy to reproduce.

Sample output: Notice only one node is able to actually obtain lock after failover completes.

Simple script calling TTL and HGETALL on the lock via redis-cli. Notice that after failover, only 1 redisson node ever gets the lock.

DEBUG level logs (can provide trace if desired) app01.log app02.log

In these test logs, I had to perform the “CLUSTER FAILOVER” command twice to trigger the error, but the vast majority of the time it only takes one FAILOVER command to induce the problem. This is a problem for us because we have a use case where all redisson nodes must be able to obtain a given RLock at some point in time.

About this issue

Original URL
State: closed
Created a year ago
Comments: 49 (22 by maintainers)

Commits related to this issue

Fixed - a new attempt should be made on WAIT error during failover. #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - RLock can only be obtained by single redisson node if "None of slaves were synced" error occurred #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - Node hasn't been discovered yet error isn't resolved by a new attempt for RBatch and RLock objects #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - Node hasn't been discovered yet error isn't resolved by a new attempt for RBatch and RLock objects #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - Node hasn't been discovered yet error isn't resolved by a new attempt for RBatch and RLock objects #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - RLock can only be obtained by single redisson node if "None of slaves were synced" error occurred #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - RedisClusterDownException, RedisLoadingException, RedisBusyException, RedisTryAgainException, RedisWaitException are thrown by RBatch and RLock objects even if these errors disappeared after n... — committed to redisson/redisson by deleted user a year ago
Fixed - Node hasn't been discovered yet error isn't resolved by a new attempt for RBatch and RLock objects #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - continuously attempts of INFO REPLICATION command execution until attempts limit reached by RLock object after failover. #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - "READONLY You can't write against a read only replica.." is thrown after failover in sentinel mode. #4822 — committed to redisson/redisson by deleted user a year ago
Fixed - no retry attempts are made for "None of slaves were synced" error. #4822 — committed to redisson/redisson by deleted user a year ago

Most upvoted comments

@servionsolutions

By the end of this month. Thank you for testing!

mrniko on Mar 10, 2023

@servionsolutions

Thanks for update.

Please try attached version.

redisson-3.20.1-SNAPSHOT.jar.zip

mrniko on Mar 7, 2023

Resolved.

@servionsolutions

Many thanks for testing!

mrniko on Mar 23, 2023

@servionsolutions

The problem is that for ~5 minutes (from 19:55:55 to 19:20:07), from begin of attempt to unlock and after app02 record in redis expires via TTL to when the retry completes, both nodes think they have the lock at the same time.

It happened because INFO_REPLICATION command was bounded to Redis node explicitly.

Can you try version attached?

redisson-3.20.1-SNAPSHOT.jar.zip

mrniko on Mar 3, 2023

@servionsolutions