azure-functions-host: Cosmos Change Feed Input Trigger - Lease Stops Sending Triggers
Investigative information
Please provide the following:
- Timestamp: 2017-12-28T15:42:22.000Z
- Function App version (1.0 or 2.0-beta): 1.0
- Invocation ID: 17bb63ca-ab9c-45bc-9b77-41c8e9a21dee
- Region: East US 2
Repro steps
Seems that after our change feeds have been running for awhile they intermittently just stop working. It seems it could be related to RU/s being exhausted, but there is no recovery process. There are 429s that were logged in the DB being monitored as it was under occasional load.
Expected behavior
Even when the DB being monitored is under load or receives 429s, the lease for the change feed should not ever get into a hung state. It should recover when the lease gets frozen. The input trigger should never stop firing.
Actual behavior
The input trigger stops firing and no new change feed docs are processed despite there still being traffic/activity in the lease db collection.
Known workarounds
The only known workaround I’ve been able to identity is to delete the lease collection and restart the function host. Restarting the function host recreates the lease collection (assuming createLeaseCollectionIfNotExists=true
in function.json
) and input triggers begin flowing in again. This is a very weak solution as we lose any change feed activity since the feed stopped responding.
Related information
Language: Javascript AzureWebJobsHostLogs: stops writing trigger events when change feed is hung/expires, even though there are change feed events occurring and the lease collection has activity. The function host just stops checking for new events or is hung. Restarting the function app has no effect.
Lease Collection Partition State
Upon further inspection of the lease collection, all other partition documents contain state=2
and contain a valid Owner
guid. One partition does not contain an owner and has state=1
.
Invalid Lease Collection Document
{
"state": 1,
"PartitionId": "7",
"Owner": null,
"ContinuationToken": "\"245895\"",
"SequenceNumber": 4726
}
Other partitions are sending triggers to the function app, just not this partition stuck in state=1
. Appears this issue is related to a partition state being stuck.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 51 (15 by maintainers)
Since this PR is still pending review, I would guess the ETA for this to be merged and then deployed to prod to be around mid June.
@alohaninja Yes, large transaction, different account (now test). Should be mitigated now.
Please try again. This is slightly different issue: with maxItemCount=1 there is long transaction that doesn’t fit into max response size, that must come from script modifying lots of big documents as part of one transaction, and currently continuation token cannot address a part of a transaction. Soon this is going to be supported.
@alohaninja Right now in Azure Functions the setting is not settable, but it is coming shortly.
@alohaninja - no, this means “use default” which is 100. Can you try to reduce and see which value? Change for CFP (github, functions integration would be some time later) to take care of this is coming soon.
@ealsur - this probably needs to be in the extensions repo that has cosmos db binding in it. Could you investigate?