etcd-java: Lease expiration managed by the PersistentLease?
Hello! I’m facing some spurious lease expiration managed by the PersistentLease even if there’re no network partition or hardware overloading issue.
The problem is that, sometimes, all leases managed by PersistentLeases get expired at the etcd server side and never go back active again (or re-granted) until the etcd server restarts. Actually a persistent lease instance is not notified of LeaseState.EXPIRED state even when a lease is actually expired at the etcd server when the issue hits. Interestingly, an EXPIRED event is fired immediately followed by an ACTIVE event fired when the client is reconnected to the restarted etcd server.
I believe (from some observation and code inspection) that the persistent lease monitors lease state and re-creates expired (not closed) lease and exposes its id through PersistentLease.getLeaseID() once a lease id renewed so I send ttl request to assure the lease is OK to be related to an entity. (if ttl > 0 part) Here’s roughly what I’m doing to create/refresh a PersistentLease-tied entity.
long getValidLease(PersistentLease lease) {
validLease = -1;
// lightly spin until I get a valid ttl response and id.
// normally the body gets executed exactly once.
do {
// omitted: throw if the lease is CLOSED
// since lease.getLeaseId() not guarantees a validness of the lease id,
// I chose to use direct TTL request to query its state.
ttlResp = etcdLease.ttl(lease.getLeaseId()); // lease id is updated by the event loop
if (ttl > 0)
validLease = ttlResp.getID();
} while (lease.getCurrentTtlSecs() < 1); // also gets updated by the event loop
return validLease;
}
long count(ByteString key) {
return etcdKV.get(key).countOnly().async()
.get(1000ms).getCount(); // 1 second timed wait-and-get
}
// operation PUT
long validLease = getValidLease(persistentLease);
etcdKV.put(key, data, validLease);
// operation REFRESH
if (count(key) == 0) {
PUT_OPERATION(key, data); // put operation right above
}
All the entities (not many, < 20) get refreshed every 5 seconds. But after the spurious lease expiration all operations hang at the do-while loop in the getValidLease get expired lease ids through getValidLease and following operations fail because given lease id is already expired.
The etcd server looks OK: at that moment the etcd debug log shows that TTL requests from the do-while loop arrive and get answered at very high rate (due to the do-while loop) and further requests from clients (like etcdctl provided with the server distribution) get properly handled, and even granting a new lease from the same etcd-java client and making it persistent succeeds! It seems that the internal grpc client and event loop assigned with a persistent lease fail to handle responses from the server for some reason.
The issue appears randomly regardless of the server load status. As mentioned earlier, one simple solution for this is to restart the etcd server. After etcd-java reconnects to the restarted server and then all the operations work as expected again.
The etcd server (single instance configuration) is deployed in a small testbed and a spring boot application using etcd-java is also running at the same host, which means the client connects to the etcd server using localhost as the address.
Is there any recommended way dealing with the validness of a persistent lease, or am I missing something crucial?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 21 (11 by maintainers)
Thanks @hsyhsw, no need to include a jar, just (preferably minimal) source code, e.g. just a class with main method would be great.
Great, thanks @hsyhsw! (though I know it took a long time to show up last time you tried so maybe it’s not definite yet…)
@hsyhsw great! Thanks again and fingers crossed…
@njhill yeap. I could test with the upgraded lib and will let you know. It may take some time cuz I have vac plan on the next week. Hayppy May Day!
@njhill Thanks for your effort. The while loop repeatedly granting and closing the lease was just to accelerate(?) a reproduction. It is reproducible without it. I think the
PersistentLeaseKeyis not applicable since putting an key-value entity involves some transactions, not simply maintaining a lease-kv relationship. (also, it is reproducible without transactions) I applied your suggestion from the code snippet. Thnaks!Currently I’m just periodically checking the lease states and re-granting if needed as a workaround. This is working fine for now but should be resolved anyway… Hopefully, in the next release the issue would be investigated and fixed. Thanks again for your effort!
Thanks a lot @hsyhsw and sorry for the delay I was on vac last week. I will play with it this week and try to get to the bottom of it.
OK. I’ll post an executable jar along with all source files when it is available. It may take some time to test whether it is the right reproducer.