azure-functions-durable-python: Orchestrator child functions get stuck in running state due to scheduled state not updating on activity functions

🐛 Describe the bug We are seeing an extremely large time between task creation, scheduling, and completion when running durable functions in python. In particular we are seeing it occur in our sub-orchestrators that are fanned out from a parent orchestrator. Below is a sample from the debug tool for orchestrators that I retrieved at 2022-07-06T12:46:

Timestamp(UTC) | EventType | EventId | Name | ScheduledTime(UTC) | Result | Details
2022-07-06T10:52:11.8570597Z | ExecutionStarted |   | func_insights_update_single_provider_orchestrator |   |   |
2022-07-06T10:52:31.0964031Z | TaskScheduled |   | func_insights_get_update_accounts_response_url_activity

As you can see from this, the task has been sitting in a scheduled state for hours, with no set time chosen for task scheduling (i.e. this field is simply blank), and no errors or completions outputted.

An example that shows the gap between scheduled and completed can also be seen here:

Timestamp(UTC) | EventType | EventId | Name | ScheduledTime(UTC) | Result | Details
2022-07-06T12:08:30.4798419Z | TaskCompleted | 3 | func_insights_check_update_response_url_and_save_activity | 2022-07-06T11:42:13.6764145Z | {"status":"Succeeded","transactionsCount":2} |

This behaviour is occurring in a function app that uses create_timer in order to wait for 2 minutes for an asynchronous process to complete in a 3rd party. At first I believed this to be the root cause, however when I run functions on their own outside of the fan out/fan in, I see the expected behaviour described below.

In order to rule out storage account issues I cleared the queues/history tables but the issue is still persisting unfortunately. Prior to clearing the history tables I could see that some of these orchestrations had been stuck in a scheduled state since 2022-07-03.

In case it helps with your investigation, I have included my host file below:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "datasciencefn",
      "maxConcurrentActivityFunctions": 8,
      "maxConcurrentOrchestratorFunctions": 8
    }
  },
  "functionTimeout": "12:00:00",
  "logging": {
    "logLevel": {
      "Function": "Information"
    },
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": false,
        "excludedTypes": "Request"
      }
    }
  },
  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[2.*, 2.8.3)"
  }
}

🤔 Expected behavior When running the functions I would expect them to be scheduled and completed in a reasonable time frame giving the level of complexity. For example in our dot net function app, the entire fan in/out completes in about 20 minutes, with each orchestrator taking only a minute or two to complete after being scheduled. For example

Timestamp(UTC) | EventType | EventId | Name | ScheduledTime(UTC) | Result | Details
04a2ef749821414fbeb3d85fdf1f0c3f:0 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:00:40Z | 30s944ms | Completed | 88fef2c0-9681-11ec-800b-fd76d223f577 | Completed |  
04a2ef749821414fbeb3d85fdf1f0c3f:1 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:00:20Z | 10s778ms | Completed | ea4b52d0-e5e1-11ec-973d-21ae3454ae9c | Completed |  
04a2ef749821414fbeb3d85fdf1f0c3f:10 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:01:03Z | 53s921ms | Completed | de432590-73da-11ec-a577-d72d607b06df | Completed |  
04a2ef749821414fbeb3d85fdf1f0c3f:100 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:01:27Z | 1m17s | Completed | 883b2ed0-e3f7-11ec-bc34-ef840eea0f91 | Completed |  
04a2ef749821414fbeb3d85fdf1f0c3f:1000 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:02:45Z | 2m36s | Completed | e6061840-bff3-11ec-a10b-c7b644a8ed8f | Completed |  
04a2ef749821414fbeb3d85fdf1f0c3f:1001 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:02:36Z | 2m26s | Completed | 0e398e70-f06c-11ec-b1f7-436ba0158194 | Completed

☕ Steps to reproduce

What Durable Functions patterns are you using, if any? The function apps are using fan out/fan in as well as durable timers Any minimal reproducer we can use? I do not have a project that recreates this issue at this time Are you running this locally or on Azure? I am running this on Azure

⚡If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

Timeframe issue observed: latest run started at 2022-07-06T10:52:00Z UTC
Function App name: data-science-live-fn
Function name(s): func_insights_all_provider_update_orchestrator and func_insights_update_single_provider_orchestrator
Azure region: North Europe
Orchestration instance ID(s): a5717aa6769f4ddcab658c4cd7c6092c
Azure storage account name: ***

If you don’t want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it’s extremely difficult to look up information.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 35 (1 by maintainers)

Most upvoted comments

Hi @davidmrdavid. Apologies for the delay, I’ve been busy migrating our service into our dotnet function app. Everything is running as expected now that we’ve done some batching and moved it to dotnet. When I rewrote it to do batching in python I was still seeing stalling occurring, hence the migration to dotnet also.

I don’t have a reproducer readily available I’m afraid that isn’t our actual python function and I’m not sure I’ll have time to write one this week. If you’re okay to wait another week I should have some time to write one for you?

nudejustin on Jul 29, 2022

@davidmrdavid

I had to jump back onto my computer for a minute in the end, so I double checked the version number and it is indeed on SDK 1.1.5.

In terms of trends it occurs on 3 of the last 4 instances, with the exception of the most recent parent function I ran. At a quick glace I can’t see anything that jumps out at me that would indicate a difference between these. I think it might make sense to allow the function to run on its own over the weekend now that the children are at least updating again as expected. Then once we have some more examples I can provide additional feedback to you and we can decide next steps if that sounds good to you?

nudejustin on Jul 15, 2022

@davidmrdavid

Noted, thanks so much for your assistance with this. Hopefully this is something simple that I’m just missing, but I feel like I’ve followed the steps required.

With regards to the non determinism here is a couple of examples instance ids:

8e8499cb2da146bd9dcd83822ad41e14
70f4e6b5053b42348716048d37268034

In the meantime I’m going to take a look at the high failure rate on that single activity, just to rule out anything stupid on our side that we’ve missed.

nudejustin on Jul 15, 2022

@davidmrdavid I’ve reviewed the actual repository for the durable extension bundles and realised my misunderstanding on how they were versioned internally, apologies thats my bad for not fully grasping that.

The 2.8.3 maximum version above was added while I was debugging this issue initially, as I thought that perhaps the issue seen by another user in the github issues around try and catches may be causing abherent behaviour. Initially our host.json has the value "[2.*, 3.0.0)". Looking at the github repo for the extension bundles the latest 2.* release is 2.10.0 which only targets the durable task extension version 2.6.1.

I’m going to try manually force a durable task extension update to version 2.7.2 using the core tools and I’ll report back how I get on with that update having been made, since it would appear that the app has never ran on 2.7.2 before if it was targetting a maximum version of 3.0.0 in the extension bundles.

Thanks for your help so far and I’ll try to get some confirmation for you regarding the outcome within the next day.

nudejustin on Jul 12, 2022