azure-cosmos-dotnet-v3: Lease token was taken over by owner exception

Some of our change feed processors are stuck and we’re seeing some OperationCanceledException as well as these types of CosmosExceptions in the logs:

Response status code does not indicate success: PreconditionFailed (412); Substatus: 0; ActivityId: ; Reason: (796 lease token was taken over by owner something-6c082e98-54b1-4fe9-9486-fc51ce2be403

What does it mean for a lease token to be taken by another owner? I was under the impression that a single lease is owned by a single compute instance. The change feed processors also seem to start running again from time to time and then halt again seemingly randomly.

Our configuration involves:

  1. A single monitored container
  2. Multiple change feed processors doing different things
  3. All of them use the same lease configuration
  4. Multiple instances of the processors on multiple hosts where every instance name is postfixed by a guid so it has a unique name

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

Here’s what I found. The issue is indeed on our side and it occurred when I was migrating to Cosmos DB v3. I’ll describe it here for future reference.

We have manual checkpointing logic with a lot of other things built on top of change feed processors. In our v2 code this is how things would play out during an unhandled exception:

  1. The observer that processes the change captures the exception in a private field. If that observer’s ProcessChangesAsync method gets called again and this field is set, it logs that it’s in a faulted state and throws it again.
  2. When this exception is thrown the lease is released. The observer is no longer invoked.
  3. The lease is eventually acquired by another instance and a new observer gets instantiated by the factory
  4. Everything continues working

When an exception occurs in our initial v3 code:

  1. Same thing as before although the private field is inside a class that contains our delegate that processes changes
  2. When this exception is thrown the lease is released
  3. After the lease renewal interval this same delegate is invoked again to process something but due to the exception field still being set, it logs that it’s in a faulted state and throws it again
  4. Go to 1

Now any instance that had an exception would halt and perpetually retry acquiring a lease and attempting to process. Ergo infinite “Lease was taken over by owner” logs.

So for us what used to be instances of IChangeFeedObserver that get dumped every time they have an unhandled exception (v2), were now delegates that continued to be reused in a perpetual faulted state (v3).

I just removed the parts about keeping the exception in a private field.

Thank you for all the help! I wouldn’t have been able to find my mistake without it.

You can close this issue.

@ealsur We’re investigating this but for the time being I do not believe it to be related to Cosmos DB v3. We have a bunch of things built on top of the change feed processor SDK and some of them like the batching is probably the crux of it. Soon as I get more understanding of what’s happening I’ll close this issue 😃

The callbacks are async and they are awaited. NotifyLeaseAcquireAsync is triggered when a lease is acquired by the instance, conceptually is the same as the OpenAsync -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionControllerCore.cs#L68

If you throw on that method however, it won’t stop the lease being acquired -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/Monitoring/ChangeFeedProcessorHealthMonitorCore.cs#L40-L51 because the main idea was for monitoring hooks.

You could use it to initialize your buffer, I don’t see why not though (unless you could throw exceptions and you were relying on that behavior from V2).

Once the notification for Acquire is sent, the Supervisor is started -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedManagement/PartitionSupervisorCore.cs#L32

Which starts the Processor -> https://github.com/Azure/azure-cosmos-dotnet-v3/blob/e41eea5ad6db54e51e953552bfccf751c0de291f/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/FeedProcessing/FeedProcessorCore.cs#L38

Which is the one that calls the delegate.

Any failure during processing would release the lease, which would result in a NotifyLeaseReleaseAsync.

The exception means what the Message says, the lease was rebalanced to another host.

Please see https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-processor?tabs=dotnet

Leases on a steady state are maintained by a single host but there is a period of rebalancing. Check the dynamic scaling section of the article.

When CFP starts, it will do rebalancing until it reaches a steady point where the leases are evenly distributed. If you increase or decrease instances, rebalancing will occur too.