azure-functions-durable-extension: Instances are getting stuck Running even though execution has been completed. ExecutionCompleted is no longer happening.

Resolved

I can now confirm that the issue was caused by adding the “Dynatrace OneAgent” site extension to our function app in Azure. How exactly it breaks is unclear, but the process seems to do something that results in illegal awaits in the durable task framework. See more here: https://github.com/Azure/azure-functions-durable-extension/issues/2133#issuecomment-1343509726

Original post below

Description

My Durable Functions app has been running fine in Azure for while but has suddenly started to misbehave. Instances are completing their entire execution for the orchestrator function, however they are no longer completing their instance status. The statuses stay as “Running” and the “ExecutionCompleted” event is not added to the instance history table.

Expected behavior

At the end of an orchestration function, the instance should complete, fail or restart (ContinueAsNew).

Actual behavior

The orchestrator runs all the way through logically but the instance does not end correctly. I have found a single instance that has been marked as Completed since the issue appeared.

Relevant source code snippets

(Chopped up and simplified example to make sense of the screenshots)

        [FunctionName("Orchestrator")]
        public async Task RunOrchestrator(
            [OrchestrationTrigger] IDurableOrchestrationContext context,
            ILogger log)
        {
            var data = context.GetInput<OrchestratorInputModel>();
            try
            {
                if (data.Something)
                {
                    await context.CallActivityAsync("Calculate", data);
                    context.ContinueAsNew(data);
                    return;
                }
                await context.CallActivityAsync("Publish", data);
            }
            catch (SomeException exception)
            {
                await context.CallActivityAsync("UnrecoverableErrorHandler", exception);
            }
        }

Known workarounds

Redeploying to point to a new task hub worked for a few weeks before the problem returned.

App Details

  • Microsoft.Azure.WebJobs.Extensions.DurableTask: 2.6.0
  • Azure Functions runtime version: 3
  • Programming language used: C#
  • Framework: .NET Core 3.1

Screenshots

Columns bandicam 2022-04-07 12-24-07-299

Correctly ending instance. bandicam 2022-04-07 12-21-42-707

Instances that have run all the way through but are still running and missing the ExecutionCompleted line. bandicam 2022-04-07 12-21-54-390 bandicam 2022-04-07 12-23-20-433 bandicam 2022-04-07 12-22-01-870

If deployed to Azure

  • Timeframe issue observed: 2022-04-05T11:00:00Z onwards
  • Azure storage account name: can share privately if requested
  • Function App name: can share privately if requested
  • Function name(s): Orchestrator
  • Azure region: Australia East
  • Orchestration instance ID(s):
    1. 00083b81c88b47db8e5829fe88858e3a
    2. 00f41cc833ce44d781e5b857a2fd8381
    3. 699e9886aea3416fbcdd92b11b8dbc84

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 30 (17 by maintainers)

Most upvoted comments

@davidmrdavid I also think it’s probably doing something that’s affecting the threading based on the fact that we start to see those “Multithreaded execution” logs.

If you do want to have a try on your end, you can follow this guide here. Just note that you’ll need to actually have a Dynatrace environment to connect the extension to. Looks like there is a free trial available but I’ve never set one up myself so all I can say is good luck. 😃

Hi @davidmrdavid

I’m just writing to give an update on our findings.

So first to answer your questions: Yes, we saw the errors on our side and yes, they were showing as Running.

Major news

I’m about 95% confident that we’ve found the actual culprit! We’ve been able to trace the start of the issue to when we added the “Dynatrace OneAgent” extension to our application.

Our deployments deploy to a new slot (without the extension) then a slot swap is done which explains why new deployments fixed the issue for us in the past.

image

We’re going to just not install it on this particular app going forward and hopefully we’ll be set. Thanks so much for your help with this and hopefully this can help others.

Hi @philwhere - apologies for the late response, I thought I had replied already.

I did take a look at the code and did not find any immediate issues with it. To help us make progress in this investigation, it would be most helpful for the team to repro this with DF Extension 2.7.2 for us to see if our new telemetry is able to flag this issue as being caused by “an illegal await”.

Thanks @kirk-marple. I’ve actually been using your small repro to try to debug this 😃 . I’ll reply shortly in our thread to with a .zip of my test project. I’ll need a few minutes

Yes, there have been no code changes. The last deployment finished 2022-03-14T20:56:17Z. I’ll let my team know about the Functions Host action point and I’ll let you know if I see any changes 🤞 Thanks @davidmrdavid

@philwhere @davidmrdavid Just confirmed this is an issue with 2.6.1; rolling back to 2.6.0 made my similar issue go away.

Hi @philwhere,

Thanks for reaching out. I’ve created an internal thread where we’re investigating your incident, I’ll get back to you as soon as we make progress.

In the meantime, a preliminary look at our telemetry does suggest that your orchestration is somehow getting stuck when receiving it’s last Activity result. For some reason, your orchestration’s last invocation logs a FunctionAwaited message implying that it still has work to wait for, even though by then it should have received all results. I don’t have a good understanding of what’s going on with it just yet, but I’ll look to keep you updated.