azure-cosmos-dotnet-v3: Lease token was taken over by owner exception
Some of our change feed processors are stuck and we’re seeing some OperationCanceledException
as well as these types of CosmosException
s in the logs:
Response status code does not indicate success: PreconditionFailed (412); Substatus: 0; ActivityId: ; Reason: (796 lease token was taken over by owner something-6c082e98-54b1-4fe9-9486-fc51ce2be403
What does it mean for a lease token to be taken by another owner? I was under the impression that a single lease is owned by a single compute instance. The change feed processors also seem to start running again from time to time and then halt again seemingly randomly.
Our configuration involves:
- A single monitored container
- Multiple change feed processors doing different things
- All of them use the same lease configuration
- Multiple instances of the processors on multiple hosts where every instance name is postfixed by a guid so it has a unique name
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (11 by maintainers)
Here’s what I found. The issue is indeed on our side and it occurred when I was migrating to Cosmos DB v3. I’ll describe it here for future reference.
We have manual checkpointing logic with a lot of other things built on top of change feed processors. In our v2 code this is how things would play out during an unhandled exception:
ProcessChangesAsync
method gets called again and this field is set, it logs that it’s in a faulted state and throws it again.When an exception occurs in our initial v3 code:
Now any instance that had an exception would halt and perpetually retry acquiring a lease and attempting to process. Ergo infinite “Lease was taken over by owner” logs.
So for us what used to be instances of
IChangeFeedObserver
that get dumped every time they have an unhandled exception (v2), were now delegates that continued to be reused in a perpetual faulted state (v3).I just removed the parts about keeping the exception in a private field.
Thank you for all the help! I wouldn’t have been able to find my mistake without it.
You can close this issue.
@ealsur We’re investigating this but for the time being I do not believe it to be related to Cosmos DB v3. We have a bunch of things built on top of the change feed processor SDK and some of them like the batching is probably the crux of it. Soon as I get more understanding of what’s happening I’ll close this issue 😃
The callbacks are
async
and they areawait
ed. NotifyLeaseAcquireAsync is triggered when a lease is acquired by the instance, conceptually is the same as the OpenAsync -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionControllerCore.cs#L68If you throw on that method however, it won’t stop the lease being acquired -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/Monitoring/ChangeFeedProcessorHealthMonitorCore.cs#L40-L51 because the main idea was for monitoring hooks.
You could use it to initialize your buffer, I don’t see why not though (unless you could throw exceptions and you were relying on that behavior from V2).
Once the notification for Acquire is sent, the Supervisor is started -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionSupervisorCore.cs#L32
Which starts the Processor -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedProcessing/FeedProcessorCore.cs#L38
Which is the one that calls the delegate.
Any failure during processing would release the lease, which would result in a NotifyLeaseReleaseAsync.
The exception means what the Message says, the lease was rebalanced to another host.
Please see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-processor?tabs=dotnet
Leases on a steady state are maintained by a single host but there is a period of rebalancing. Check the dynamic scaling section of the article.
When CFP starts, it will do rebalancing until it reaches a steady point where the leases are evenly distributed. If you increase or decrease instances, rebalancing will occur too.