azure-functions-durable-extension: Orchestrations freezing mid-execution

Description

We are still occasionally experiencing large gaps in time between the scheduling and the starting of durable orchestrations. In this case, all durable orchestrations got “stuck” and would not start until a complete restart of the Function App. There is virtually no load on the Function App when this occurs, and it’s not clear that scaling events are related to the issue.

Expected behavior

We expect that, under no load, there should not be more than a few seconds of a gap between an orchestration or sub-orchestration being scheduled and started.

Actual behavior

We saw a 2-minute gap for one part of a sub-orchestration and then a complete freeze of the rest of the orchestration that appeared to be indefinite and was only resolved by completely restarting the Function App.

Relevant source code snippets

This is my host.json. If one of these parameters is ill advised, please let me know. We were guessing on some of these.

{
    "version": "2.0",
    "extensions": {
        "durableTask": {
          "hubName": "EgressHub",
          "storageProvider": {
            "useLegacyPartitionManagement": false,
            "maxQueuePollingInterval": "00:00:01"
          }
        }
      },
    "logging": {
        "applicationInsights": {
            "samplingSettings": {
                "isEnabled": true,
                "excludedTypes": "Request"
            }
        },
        "logLevel": {
          "Default": "None"
        }
    }
}

Known workarounds

Restarting the Function App (3.5 hours later) immediately resolved the backlogged work.

App Details

Durable Functions extension version (e.g. v1.8.3): 2.7.1
Azure Functions runtime version (1.0 or 2.0): 4
Programming language used: .NET (C#)

Screenshots

This screenshot of our Kibana instance (where all times all US Central, so -5 hours off Zulu) shows a 1.5-minute gap between a sub-orchestration being scheduled and being run. (Note that “Orchestration started” is a few ms after completed. They are out-of-order but you can ignore that, as we are using an activity function to log to Kibana and that sometimes happens. I checked to make sure that the two correlate.)

This sub-orchestration had three nested sub-orchestrations, each of which is labeled here as an Action orchestration and were executed in parallel.

Timeframe issue observed: June 9th, 2022, 08:00:00 GMT through 11:30:00 GMT
Function App name: connectx-egress-stg
Function name(s): RunWorkflowOrchestration
Azure region: US East
Orchestration instance ID(s): b1da195c78254fec9bcc60e11f74b803
Azure storage account name: strgecnctxstgegress

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 22 (5 by maintainers)

Most upvoted comments

@cgillum had suggested that a workaround may be to minimize host scale-out events. We are on a Premium plan (EP1) and, since the time this incident occurred, have rolled out a 4-instance minimum for this app on that plan. We’ve also upped the partition count to 8 in the hopes that a scale-out event will be better tolerated.

It’s hard to say if that will make a difference, as this is hard to reproduce. But I will certainly update the thread if it happens again after having done that.

fiddlerpianist on Jun 13, 2022

@a3y3 - Chris isn’t available at the moment (he’s on vacation). But we believe MSSQL doesn’t have partition management so it would prevent worker change shut down errors.

lilyjma on Jul 26, 2022

Since we originally started reporting this issue (back when we were using 2.6.1), we have had 2 complete freezes which required a full restart: one in one of our production environments and one in our staging environment. These freezes are very worrisome, as we are an application whose durable tasks need to be timely to be at all relevant. And it sounds like there is little to nothing that we can do about them.

What mitigation steps can we take that we haven’t already? Should we be moving to a Windows-based plan so that we can take a memory dump, should we see the issue again?

To be honest, we never expected we would have these kinds of issues when we first implemented the solution using Durable Tasks. Should we be considering an alternative at this point? If so, what would you recommend? If not, should we be considering a different implementation of Durable Tasks (such as Netherite)? Would that even help in this case, or wouldn’t that matter?

fiddlerpianist on Jul 14, 2022

Has this fallen from the radar?

fiddlerpianist on Jul 11, 2022

It looks like this might be a new version of some of the lease management issues we’ve been seeing. FYI @amdeel / @davidmrdavid.

In particular, your sub-orchestration (2798b7e982a44f4b981da5d2a69eb736:4) was scheduled to run on the queue/partition egresshub-control-00, but for some reason we were unable to correctly acquire a lease on egresshub-control-00, which caused a ~90 second delay between the time this sub-orchestration was scheduled and when it started running. It seems to me that it self-healed in this case, but the problem in general is consistent with your description of occasionally needing to restart your function app.

We’ll need to dive deeper to figure out how this happened.

cgillum on Jun 13, 2022