DistributedLock: ZooKeeperNetEx connection loss issue: an acquired lock seems is not released
I’ve implemented a lock with the Zookeeper with this configuration :
- DistributedLock.ZooKeeper - Version=“1.0.0”
- dotnet version 6.0
- Hosted on K8s (one pod, there is no concurrent request)
- Zookeeper server configuration on K8s :
version: “3.9” services: zk1: container_name: zk1 hostname: zk1 image: bitnami/zookeeper:3.8.0-debian-11-r57 ports: - 2181:2181 environment: - ALLOW_ANONYMOUS_LOGIN=yes - ZOO_SERVER_ID=1 - ZOO_SERVERS=0.0.0.0:2888:3888 - ZOO_MAX_CLIENT_CNXNS=500
There are several worker services inside the application, each of them working with a different lock key.
periodically it tries to accuqire the lock and do some processes. It seems they are working without problem, but after a while, I get this exception
Locking failed.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
It seems the lock cannot be acquired because it has not been released, although there is no concurrent request for the lock key.
The LockService code in dotnet :
`
private TimeSpan _connectionTimeoutInSecond = TimeSpan.FromSeconds(30);
private TimeSpan _waitingForLockInSecond = TimeSpan.FromSeconds(30);
public async Task<LockProcessResult> DoActionWithLockAsync(string lockKey, Func<Task> func)
{
var processResult = new LockProcessResult();
try
{
var @lock = new ZooKeeperDistributedLock(lockKey, _configuration.ConnectionString, opt =>
{
opt.ConnectTimeout(_connectionTimeoutInSecond);
});
await using (var handle = await @lock.TryAcquireAsync(timeout: _waitingForLockInSecond))
{
if (handle != null)
{
// I have the lock
await func();
}
else
{
processResult.SetException(new LockAcquisitionFailedException(lockKey));
}
}
}
catch (Exception ex)
{
//I got the exceptions here
processResult.SetException(ex);
}
return processResult;
}`
I appreciate any suggestion
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 50 (22 by maintainers)
Unfortunately we are getting Connection Loss sometimes, but it will be gone in a minute.
Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
@madelson we have just tested your change locally and in a K8 cluster and that code change fixed the issue - could you issue a PR for this change against the main repo?
No problem, I’ll follow the approach, but unfortunately for now, I don’t have access to the server, keep you posted as soon as I apply the changes.
@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?
@devlnull excellent. Thanks for testing! Feel free to use the prerelease version for now and keep me posted on any issues you encounter. The only change is the ZooKeeperNetEx alternative.
If @MajeQafouri also comes back with a good test result I’ll publish it as a stable version.
Hi, We could finally test the alpha package. I don’t know if it helps, but We’ve test it and compare with old package, the result was acceptable. no connection loss anymore. Good luck.
Sorry for the late answer, unfortunately, I’m packed these days, and couldn’t test the new Package, BTW thanks for the effort. As soon as I manage to test it, keep you posted.
@MajeQafouri I’m looking into whether we can work around this problem within DistributedLock itself.
I tried to directly use the zookeeper library https://www.nuget.org/profiles/shayhatsor2 I faced the same issue in the dockerized environment, then traced the Nuget package code, and got an exception in the WriteLock method. Maybe the author can help us @shayhatsor
Thanks, I’ll check it.
Have you checked your zookeeper ? Made sure that there aren’t any pod/container restarts - and checked its log for indications on why it would drop a connection ?