azure-functions-durable-python: Orchestrator child functions get stuck in running state due to scheduled state not updating on activity functions
🐛 Describe the bug
We are seeing an extremely large time between task creation, scheduling, and completion when running durable functions in python. In particular we are seeing it occur in our sub-orchestrators that are fanned out from a parent orchestrator. Below is a sample from the debug tool for orchestrators that I retrieved at 2022-07-06T12:46
:
Timestamp(UTC) | EventType | EventId | Name | ScheduledTime(UTC) | Result | Details
2022-07-06T10:52:11.8570597Z | ExecutionStarted | | func_insights_update_single_provider_orchestrator | | |
2022-07-06T10:52:31.0964031Z | TaskScheduled | | func_insights_get_update_accounts_response_url_activity
As you can see from this, the task has been sitting in a scheduled state for hours, with no set time chosen for task scheduling (i.e. this field is simply blank), and no errors or completions outputted.
An example that shows the gap between scheduled and completed can also be seen here:
Timestamp(UTC) | EventType | EventId | Name | ScheduledTime(UTC) | Result | Details
2022-07-06T12:08:30.4798419Z | TaskCompleted | 3 | func_insights_check_update_response_url_and_save_activity | 2022-07-06T11:42:13.6764145Z | {"status":"Succeeded","transactionsCount":2} |
This behaviour is occurring in a function app that uses create_timer in order to wait for 2 minutes for an asynchronous process to complete in a 3rd party. At first I believed this to be the root cause, however when I run functions on their own outside of the fan out/fan in, I see the expected behaviour described below.
In order to rule out storage account issues I cleared the queues/history tables but the issue is still persisting unfortunately. Prior to clearing the history tables I could see that some of these orchestrations had been stuck in a scheduled state since 2022-07-03
.
In case it helps with your investigation, I have included my host file below:
{
"version": "2.0",
"extensions": {
"durableTask": {
"hubName": "datasciencefn",
"maxConcurrentActivityFunctions": 8,
"maxConcurrentOrchestratorFunctions": 8
}
},
"functionTimeout": "12:00:00",
"logging": {
"logLevel": {
"Function": "Information"
},
"applicationInsights": {
"samplingSettings": {
"isEnabled": false,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[2.*, 2.8.3)"
}
}
🤔 Expected behavior When running the functions I would expect them to be scheduled and completed in a reasonable time frame giving the level of complexity. For example in our dot net function app, the entire fan in/out completes in about 20 minutes, with each orchestrator taking only a minute or two to complete after being scheduled. For example
Timestamp(UTC) | EventType | EventId | Name | ScheduledTime(UTC) | Result | Details
04a2ef749821414fbeb3d85fdf1f0c3f:0 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:00:40Z | 30s944ms | Completed | 88fef2c0-9681-11ec-800b-fd76d223f577 | Completed |
04a2ef749821414fbeb3d85fdf1f0c3f:1 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:00:20Z | 10s778ms | Completed | ea4b52d0-e5e1-11ec-973d-21ae3454ae9c | Completed |
04a2ef749821414fbeb3d85fdf1f0c3f:10 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:01:03Z | 53s921ms | Completed | de432590-73da-11ec-a577-d72d607b06df | Completed |
04a2ef749821414fbeb3d85fdf1f0c3f:100 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:01:27Z | 1m17s | Completed | 883b2ed0-e3f7-11ec-bc34-ef840eea0f91 | Completed |
04a2ef749821414fbeb3d85fdf1f0c3f:1000 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:02:45Z | 2m36s | Completed | e6061840-bff3-11ec-a10b-c7b644a8ed8f | Completed |
04a2ef749821414fbeb3d85fdf1f0c3f:1001 | ProcessSingleInvestorLimit | 2022-06-27T03:00:09Z | 2022-06-27T03:02:36Z | 2m26s | Completed | 0e398e70-f06c-11ec-b1f7-436ba0158194 | Completed
☕ Steps to reproduce
What Durable Functions patterns are you using, if any? The function apps are using fan out/fan in as well as durable timers Any minimal reproducer we can use? I do not have a project that recreates this issue at this time Are you running this locally or on Azure? I am running this on Azure
⚡If deployed to Azure
We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!
- Timeframe issue observed: latest run started at 2022-07-06T10:52:00Z UTC
- Function App name: data-science-live-fn
- Function name(s):
func_insights_all_provider_update_orchestrator
andfunc_insights_update_single_provider_orchestrator
- Azure region: North Europe
- Orchestration instance ID(s):
a5717aa6769f4ddcab658c4cd7c6092c
- Azure storage account name: ***
If you don’t want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it’s extremely difficult to look up information.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 35 (1 by maintainers)
Hi @davidmrdavid. Apologies for the delay, I’ve been busy migrating our service into our dotnet function app. Everything is running as expected now that we’ve done some batching and moved it to dotnet. When I rewrote it to do batching in python I was still seeing stalling occurring, hence the migration to dotnet also.
I don’t have a reproducer readily available I’m afraid that isn’t our actual python function and I’m not sure I’ll have time to write one this week. If you’re okay to wait another week I should have some time to write one for you?
@davidmrdavid
I had to jump back onto my computer for a minute in the end, so I double checked the version number and it is indeed on SDK
1.1.5
.In terms of trends it occurs on 3 of the last 4 instances, with the exception of the most recent parent function I ran. At a quick glace I can’t see anything that jumps out at me that would indicate a difference between these. I think it might make sense to allow the function to run on its own over the weekend now that the children are at least updating again as expected. Then once we have some more examples I can provide additional feedback to you and we can decide next steps if that sounds good to you?
@davidmrdavid
Noted, thanks so much for your assistance with this. Hopefully this is something simple that I’m just missing, but I feel like I’ve followed the steps required.
With regards to the non determinism here is a couple of examples instance ids:
8e8499cb2da146bd9dcd83822ad41e14
70f4e6b5053b42348716048d37268034
In the meantime I’m going to take a look at the high failure rate on that single activity, just to rule out anything stupid on our side that we’ve missed.
@davidmrdavid I’ve reviewed the actual repository for the durable extension bundles and realised my misunderstanding on how they were versioned internally, apologies thats my bad for not fully grasping that.
The
2.8.3
maximum version above was added while I was debugging this issue initially, as I thought that perhaps the issue seen by another user in the github issues around try and catches may be causing abherent behaviour. Initially our host.json has the value"[2.*, 3.0.0)"
. Looking at the github repo for the extension bundles the latest2.*
release is2.10.0
which only targets the durable task extension version2.6.1
.I’m going to try manually force a durable task extension update to version
2.7.2
using the core tools and I’ll report back how I get on with that update having been made, since it would appear that the app has never ran on2.7.2
before if it was targetting a maximum version of3.0.0
in the extension bundles.Thanks for your help so far and I’ll try to get some confirmation for you regarding the outcome within the next day.