azure-functions-durable-extension: Orchestrations freezing mid-execution
Description
We are still occasionally experiencing large gaps in time between the scheduling and the starting of durable orchestrations. In this case, all durable orchestrations got “stuck” and would not start until a complete restart of the Function App. There is virtually no load on the Function App when this occurs, and it’s not clear that scaling events are related to the issue.
Expected behavior
We expect that, under no load, there should not be more than a few seconds of a gap between an orchestration or sub-orchestration being scheduled and started.
Actual behavior
We saw a 2-minute gap for one part of a sub-orchestration and then a complete freeze of the rest of the orchestration that appeared to be indefinite and was only resolved by completely restarting the Function App.
Relevant source code snippets
This is my host.json. If one of these parameters is ill advised, please let me know. We were guessing on some of these.
{
"version": "2.0",
"extensions": {
"durableTask": {
"hubName": "EgressHub",
"storageProvider": {
"useLegacyPartitionManagement": false,
"maxQueuePollingInterval": "00:00:01"
}
}
},
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
},
"logLevel": {
"Default": "None"
}
}
}
Known workarounds
Restarting the Function App (3.5 hours later) immediately resolved the backlogged work.
App Details
- Durable Functions extension version (e.g. v1.8.3): 2.7.1
- Azure Functions runtime version (1.0 or 2.0): 4
- Programming language used: .NET (C#)
Screenshots
This screenshot of our Kibana instance (where all times all US Central, so -5 hours off Zulu) shows a 1.5-minute gap between a sub-orchestration being scheduled and being run. (Note that “Orchestration started” is a few ms after completed. They are out-of-order but you can ignore that, as we are using an activity function to log to Kibana and that sometimes happens. I checked to make sure that the two correlate.)
This sub-orchestration had three nested sub-orchestrations, each of which is labeled here as an Action orchestration
and were executed in parallel.
- Timeframe issue observed: June 9th, 2022, 08:00:00 GMT through 11:30:00 GMT
- Function App name: connectx-egress-stg
- Function name(s): RunWorkflowOrchestration
- Azure region: US East
- Orchestration instance ID(s): b1da195c78254fec9bcc60e11f74b803
- Azure storage account name: strgecnctxstgegress
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 22 (5 by maintainers)
@cgillum had suggested that a workaround may be to minimize host scale-out events. We are on a Premium plan (EP1) and, since the time this incident occurred, have rolled out a 4-instance minimum for this app on that plan. We’ve also upped the partition count to 8 in the hopes that a scale-out event will be better tolerated.
It’s hard to say if that will make a difference, as this is hard to reproduce. But I will certainly update the thread if it happens again after having done that.
@a3y3 - Chris isn’t available at the moment (he’s on vacation). But we believe MSSQL doesn’t have partition management so it would prevent worker change shut down errors.
Since we originally started reporting this issue (back when we were using 2.6.1), we have had 2 complete freezes which required a full restart: one in one of our production environments and one in our staging environment. These freezes are very worrisome, as we are an application whose durable tasks need to be timely to be at all relevant. And it sounds like there is little to nothing that we can do about them.
What mitigation steps can we take that we haven’t already? Should we be moving to a Windows-based plan so that we can take a memory dump, should we see the issue again?
To be honest, we never expected we would have these kinds of issues when we first implemented the solution using Durable Tasks. Should we be considering an alternative at this point? If so, what would you recommend? If not, should we be considering a different implementation of Durable Tasks (such as Netherite)? Would that even help in this case, or wouldn’t that matter?
Has this fallen from the radar?
It looks like this might be a new version of some of the lease management issues we’ve been seeing. FYI @amdeel / @davidmrdavid.
In particular, your sub-orchestration (2798b7e982a44f4b981da5d2a69eb736:4) was scheduled to run on the queue/partition
egresshub-control-00
, but for some reason we were unable to correctly acquire a lease onegresshub-control-00
, which caused a ~90 second delay between the time this sub-orchestration was scheduled and when it started running. It seems to me that it self-healed in this case, but the problem in general is consistent with your description of occasionally needing to restart your function app.We’ll need to dive deeper to figure out how this happened.