azure-functions-durable-python: Orchestrations stuck in running state
Description
I am using Fan-out/fan-in to do a two step calculation. The first step, “CalculateRouteProfile”, calls a third party API that returns data used in the second step “CalculateCo2FromRouteProfile”.
The calculation is triggered by an HTTPStarter.
When running this for about 15’000 inputs, some of the calculations never seem to finish (most of them do, though). The ones that do not finsh are stuck in the “Running” state.
Example of a finished calculation: c553159a6c994173bf6c4381376954dc. Examples of not finished calculations follow below.
Expected behavior
All ~15’000 inputs are processed.
Actual behavior
Out of 754 orchestrators all but 14 eventually complete. The remaining 14 stay in “Running” state.
Relevant source code snippets
import logging
import azure.durable_functions as df
from azure.durable_functions.models.RetryOptions import RetryOptions
MAX_ROUTES_PER_POST_REQUEST = 100
FAN_OUT_CHUNK_SIZE = 25
def chunks(lst, n):
"""Yield successive n-sized chunks from list."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
def orchestrator_function(context: df.DurableOrchestrationContext):
""" """
list_of_route_params = context.get_input()
logging.info(
f"client input in orchestrator function: {len(list_of_route_params)} items."
)
if len(list_of_route_params) > 0:
logging.info(
f"first item: {list_of_route_params[0]}."
)
custom_status = {
"msg": "just started",
"total_number_of_routes": len(list_of_route_params),
"current_chunk": None,
"routes_in_current_chunk": None
}
context.set_custom_status(custom_status)
out = [] # this will contain our final output
# We need to avoid that one call immediately try to calculate thousands of
# routes without limits. So we split the total list of input routes into
# chunks, with FAN_OUT_CHUNK_SIZE as the limit per chunk.
chunk_list = list(chunks(list_of_route_params, FAN_OUT_CHUNK_SIZE))
logging.info(
(f"I have {len(chunk_list)} chunks with chunk size {FAN_OUT_CHUNK_SIZE} "
f"for calculating {len(list_of_route_params)} route items")
)
custom_status["number_of_chunks"] = len(chunk_list)
context.set_custom_status(custom_status)
for idx, chunks_of_route_params in enumerate(chunk_list):
# this is a chained "fan-in-fan-out" call: first get the profiles,
# then calculate the profile and the emissions
custom_status["msg"] = "processing route profiles"
custom_status["routes_in_current_chunk"] = len(chunks_of_route_params)
custom_status["current_chunk"] = idx
context.set_custom_status(custom_status)
parallel_tasks = []
retry_options = RetryOptions(1000, 3)
for route_params in chunks_of_route_params:
# first calculate the routes including the profile
parallel_tasks.append(
context.call_activity_with_retry("CalculateRouteProfile",
retry_options, route_params)
)
chunk_profiles = yield context.task_all(parallel_tasks)
custom_status["msg"] = "processing from route profiles"
context.set_custom_status(custom_status)
parallel_tasks = []
for profile in chunk_profiles:
# calculate emission based on the profile
parallel_tasks.append(
context.call_activity(
"CalculateCo2FromRouteProfile",
{
"profile": profile["serialized_profile"],
"route_params": profile["route_params"]
},
)
)
out += yield context.task_all(parallel_tasks)
custom_status["msg"] = "done with processing"
custom_status["routes_in_current_chunk"] = len(chunk_list[-1])
custom_status["current_chunk"] = len(chunk_list)
context.set_custom_status(custom_status)
return out
main = df.Orchestrator.create(orchestrator_function)
host.json
:
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[2.*, 3.0.0)"
},
"extensions": {
"durableTask": {
"maxConcurrentActivityFunctions": 3,
"maxConcurrentOrchestratorFunctions": 1
}
}
}
Known workarounds
Using small batches of input (around 100 routes) seems to work reliably.
App Details
- Durable Functions extension version (e.g. v1.8.3): 1.3.0 (just called “Azure Functions” in VS Code)
- Azure Functions runtime version (1.0 or 2.0): 2.0
- Programming language used: Python
If deployed to Azure
- Timeframe issue observed: “createdTime”: “2021-03-31T15:29:19Z”
- Function App name: co2-route-calculator-dev
- Function name(s): CalculateCo2FromRouteProfile, CalculateRouteProfile
- Azure region: Switzerland North
- Orchestration instance ID(s): a9ee8c72713644619b3e1d1eaeed7811, 1d986a195d1f4708be368e86c9e5cd06, adacc094cf934a68a5ae45f4562ec961, …
- Azure storage account name: co2routecalculatordev
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 33
@scherand, we have found this is a race condition if a crash happens during the last replay of an orchestration between updating the orchestration history and updating the state table. We have filed an issue here: https://github.com/Azure/durabletask/issues/561.
This bug has been around for a couple of years, so I am surprised we haven’t seen it until now. I’m presuming it is incredibly rare, but we will make sure to fix this for our next Durable Functions extension release.
Great! In that case, it appears to me that we’ve fixed the issue originally described in this thread 😃 . As Connor mentioned, the final issue you’re seeing is now being tracked in another thread, and we’ll look to patch it as soon as possible.
I’ll be closing this thread for now, but do feel free to re-open it if the original issue re-occurs. Thank you for your collaboration and patience with us throughout this process, we’ve fixed some major correctness concerns in the process of investigating this incident, so this was really productive on our end, thank you and reach out again if you need anything!
Hi @davidmrdavid
Thanks heaps for following up with the questions about reverting the change and GitHub Actions for deployment. Sounds great to me 😃.
I agree about the warnings: since they are gone and I did not run into any issues yet I assume all is good. I need to better understand the build process at some point, that might provide additional clues. Somehow it seems that the build process (oryx?) is using a different environment (venv?) than the one configured in VS Code. For a long time (I belive before I even opened this issue) it is reporting that a newer version of
pip
is available even though the newest version is installed in the venv used by VSCode.From the Python shell in VS Code:
From the build process:
Now for the fun part 😃:
This is about the “stuck” orchestration
458aa5e1e7b04f0096ac072d251f4f89
again. It is correct that I believe the execution actually succeeded. I deduct this from the entries in the history table of the durable function. Mainly the two entries stating “ExecutionCompleted” and “OrchestratorCompleted”.And it is also true that the information in Application Insights does not seem to reflect everything from the history table. Mainly I only see twelve calls to
CalculationCleanup
instead of the expected 25.BUT, and that seems to be a bit more serious, when calling the
statusQueryGetUri
(the one likehttps://...azurewebsites.net/runtime/webhooks/durabletask/instances/458aa5e1e7b04f0096ac072d251f4f89?taskHub=...&connection=Storage&code=0e...w==
) it comes back with aruntimeStatus
ofRunning
. So client code assumes the function is not done (yet). At the end of the day I do not know if this is “just” related to a failing propagation of information to Application Insights or if this is an issue on its own.Cheers
Hi @davidmrdavid
Thank you so much for the update and the detailed instructions. I will look into this as soon as I can. I might have to wait a couple of days though before I can run another batch of calculations. I will update this issue as soon as I had a chance to tackle this.
Hi @davidmrdavid
Thanks heaps for this update! These are exciting news. Of course I do hope that the bug(s) fixed were the root cause for the issue we experienced. But I can only guess the complexity of the system and understand that only testing will definitely determine this.
For now, let me tell you that I am very impressed with the support you guys provide. The effort and transparency are exceptional. I would not have dreamed this when I created this issue. Keep that spirit up!
Looking forward to hearing from you again.
Hi @scherand!
Great to hear that Application Insights is enabled, that’s half the battle! My understanding is that logging to Application Insights from a Python application requires just a tad more setup. There might be an easier way to do this, but let me give you my way and hopefully that’s enough to fast-track us towards getting this runtime data.
So we’re going to follow the general setup outlined here. In particular, you’ll want to:
(1) Install the OpenCensus handler for Azure Functions. To do this, please add the following dependency to your
requirements.txt
opencensus-ext-azure
As an example, my full requirements.txt when I was reproducing this issue looks like:
(2) Next, we’ll set-up the OpenCensus handler in your orchestrator file. You just need to add the following import statement
and then provide your Application Insights InstrumentationKey to this snippet right here:
So, for example, the top of your orchestrator file might now look as follows
(3) Now we can start logging! The API looks as follows:
So a simple orchestrator that logs to Application Insights might look like:
(4) Now we can find your logs by going to your Application Insights resource in the Azure portal, then to the Logs option, and finally writing a query that looks like this:
So in the example above, I would query for:
I’m attaching a picture of what your output should roughly look like. See below:
And that’s about it!
Try this out with a simple test orchestration and let us know if you’re able to locate your own logs in Application Insights. Afterwards, we can make your logs be more granular by including, in the prefix, the instanceID of your orchestration. You can access this value via
context._instance_id
. With that, we should have enough to find the data we’re looking for! Please let me know if anything about this was unclear, I realize this comment is rather long by now ⚡ . Thanks!@davidmrdavid, it’s a bit hard to tell, but it appears that a failure in the fan-out call_actvity_with_retry created something weird with the Python implementation. We may want to see if we can repro this ourselves.
@pcopley, each of these stuck orchestration bugs can have fairly independent root causes, so I would recommend opening a separate ticket for your issue.