azure-functions-durable-python: Orchestrations stuck in running state

Description

I am using Fan-out/fan-in to do a two step calculation. The first step, “CalculateRouteProfile”, calls a third party API that returns data used in the second step “CalculateCo2FromRouteProfile”.

The calculation is triggered by an HTTPStarter.

When running this for about 15’000 inputs, some of the calculations never seem to finish (most of them do, though). The ones that do not finsh are stuck in the “Running” state.

Example of a finished calculation: c553159a6c994173bf6c4381376954dc. Examples of not finished calculations follow below.

Expected behavior

All ~15’000 inputs are processed.

Actual behavior

Out of 754 orchestrators all but 14 eventually complete. The remaining 14 stay in “Running” state.

Relevant source code snippets

import logging
import azure.durable_functions as df
from azure.durable_functions.models.RetryOptions import RetryOptions


MAX_ROUTES_PER_POST_REQUEST = 100
FAN_OUT_CHUNK_SIZE = 25


def chunks(lst, n):
    """Yield successive n-sized chunks from list."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]


def orchestrator_function(context: df.DurableOrchestrationContext):
    """ """
    list_of_route_params = context.get_input()
    logging.info(
        f"client input in orchestrator function: {len(list_of_route_params)} items."
    )
    if len(list_of_route_params) > 0:
        logging.info(
            f"first item: {list_of_route_params[0]}."
        )
    custom_status = {
        "msg": "just started",
        "total_number_of_routes": len(list_of_route_params),
        "current_chunk": None,
        "routes_in_current_chunk": None
    }
    context.set_custom_status(custom_status)

    out = []  # this will contain our final output
    # We need to avoid that one call immediately try to calculate thousands of
    # routes without limits.  So we split the total list of input routes into
    # chunks, with FAN_OUT_CHUNK_SIZE as the limit per chunk.
    chunk_list = list(chunks(list_of_route_params, FAN_OUT_CHUNK_SIZE))
    logging.info(
        (f"I have {len(chunk_list)} chunks  with chunk size {FAN_OUT_CHUNK_SIZE} "
         f"for calculating {len(list_of_route_params)} route items")
    )
    custom_status["number_of_chunks"] = len(chunk_list)
    context.set_custom_status(custom_status)
    for idx, chunks_of_route_params in enumerate(chunk_list):
        # this is a chained "fan-in-fan-out" call: first get the profiles,
        # then calculate the profile and the emissions

        custom_status["msg"] = "processing route profiles"
        custom_status["routes_in_current_chunk"] = len(chunks_of_route_params)
        custom_status["current_chunk"] = idx
        context.set_custom_status(custom_status)

        parallel_tasks = []
        retry_options = RetryOptions(1000, 3)
        for route_params in chunks_of_route_params:
            # first calculate the routes including the profile
            parallel_tasks.append(
                context.call_activity_with_retry("CalculateRouteProfile",
                                                 retry_options, route_params)
            )
        chunk_profiles = yield context.task_all(parallel_tasks)

        custom_status["msg"] = "processing from route profiles"
        context.set_custom_status(custom_status)
        parallel_tasks = []
        for profile in chunk_profiles:
            # calculate emission based on the profile
            parallel_tasks.append(
                context.call_activity(
                    "CalculateCo2FromRouteProfile",
                    {
                        "profile": profile["serialized_profile"],
                        "route_params": profile["route_params"]
                    },
                )
            )
        out += yield context.task_all(parallel_tasks)

    custom_status["msg"] = "done with processing"
    custom_status["routes_in_current_chunk"] = len(chunk_list[-1])
    custom_status["current_chunk"] = len(chunk_list)
    context.set_custom_status(custom_status)

    return out


main = df.Orchestrator.create(orchestrator_function)

host.json:

{
    "version": "2.0",
    "logging": {
        "applicationInsights": {
            "samplingSettings": {
                "isEnabled": true,
                "excludedTypes": "Request"
            }
        }
    },
    "extensionBundle": {
        "id": "Microsoft.Azure.Functions.ExtensionBundle",
        "version": "[2.*, 3.0.0)"
    },
    "extensions": {
        "durableTask": {
            "maxConcurrentActivityFunctions": 3,
            "maxConcurrentOrchestratorFunctions": 1
        }
    }
}

Known workarounds

Using small batches of input (around 100 routes) seems to work reliably.

App Details

Durable Functions extension version (e.g. v1.8.3): 1.3.0 (just called “Azure Functions” in VS Code)
Azure Functions runtime version (1.0 or 2.0): 2.0
Programming language used: Python

If deployed to Azure

Timeframe issue observed: “createdTime”: “2021-03-31T15:29:19Z”
Function App name: co2-route-calculator-dev
Function name(s): CalculateCo2FromRouteProfile, CalculateRouteProfile
Azure region: Switzerland North
Orchestration instance ID(s): a9ee8c72713644619b3e1d1eaeed7811, 1d986a195d1f4708be368e86c9e5cd06, adacc094cf934a68a5ae45f4562ec961, …
Azure storage account name: co2routecalculatordev

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 33

Most upvoted comments

@scherand, we have found this is a race condition if a crash happens during the last replay of an orchestration between updating the orchestration history and updating the state table. We have filed an issue here: https://github.com/Azure/durabletask/issues/561.

This bug has been around for a couple of years, so I am surprised we haven’t seen it until now. I’m presuming it is incredibly rare, but we will make sure to fix this for our next Durable Functions extension release.

ConnorMcMahon on Jul 6, 2021

Great! In that case, it appears to me that we’ve fixed the issue originally described in this thread 😃 . As Connor mentioned, the final issue you’re seeing is now being tracked in another thread, and we’ll look to patch it as soon as possible.

I’ll be closing this thread for now, but do feel free to re-open it if the original issue re-occurs. Thank you for your collaboration and patience with us throughout this process, we’ve fixed some major correctness concerns in the process of investigating this incident, so this was really productive on our end, thank you and reach out again if you need anything!

davidmrdavid on Jul 12, 2021

Hi @davidmrdavid

Thanks heaps for following up with the questions about reverting the change and GitHub Actions for deployment. Sounds great to me 😃.

I agree about the warnings: since they are gone and I did not run into any issues yet I assume all is good. I need to better understand the build process at some point, that might provide additional clues. Somehow it seems that the build process (oryx?) is using a different environment (venv?) than the one configured in VS Code. For a long time (I belive before I even opened this issue) it is reporting that a newer version of pip is available even though the newest version is installed in the venv used by VSCode.

From the Python shell in VS Code:

(.venv) PS C:\Users\...> pip --version
pip 21.1.3 from c:\users\...\.venv\lib\site-packages\pip (python 3.8)

From the build process:

...: WARNING: You are using pip version 20.2.3; however, version 21.1.3 is available.
...: You should consider upgrading via the '/tmp/oryx/platforms/python/3.8.6/bin/python3.8 -m pip install --upgrade pip' command.

Now for the fun part 😃:

This is about the “stuck” orchestration 458aa5e1e7b04f0096ac072d251f4f89 again. It is correct that I believe the execution actually succeeded. I deduct this from the entries in the history table of the durable function. Mainly the two entries stating “ExecutionCompleted” and “OrchestratorCompleted”.

And it is also true that the information in Application Insights does not seem to reflect everything from the history table. Mainly I only see twelve calls to CalculationCleanup instead of the expected 25.

BUT, and that seems to be a bit more serious, when calling the statusQueryGetUri (the one like https://...azurewebsites.net/runtime/webhooks/durabletask/instances/458aa5e1e7b04f0096ac072d251f4f89?taskHub=...&connection=Storage&code=0e...w==) it comes back with a runtimeStatus of Running. So client code assumes the function is not done (yet). At the end of the day I do not know if this is “just” related to a failing propagation of information to Application Insights or if this is an issue on its own.

Cheers

scherand on Jul 5, 2021

Hi @davidmrdavid

Thank you so much for the update and the detailed instructions. I will look into this as soon as I can. I might have to wait a couple of days though before I can run another batch of calculations. I will update this issue as soon as I had a chance to tackle this.

scherand on Jun 29, 2021

Hi @davidmrdavid

Thanks heaps for this update! These are exciting news. Of course I do hope that the bug(s) fixed were the root cause for the issue we experienced. But I can only guess the complexity of the system and understand that only testing will definitely determine this.

For now, let me tell you that I am very impressed with the support you guys provide. The effort and transparency are exceptional. I would not have dreamed this when I created this issue. Keep that spirit up!

Looking forward to hearing from you again.

scherand on May 27, 2021

Hi @scherand!

Great to hear that Application Insights is enabled, that’s half the battle! My understanding is that logging to Application Insights from a Python application requires just a tad more setup. There might be an easier way to do this, but let me give you my way and hopefully that’s enough to fast-track us towards getting this runtime data.

So we’re going to follow the general setup outlined here. In particular, you’ll want to:

(1) Install the OpenCensus handler for Azure Functions. To do this, please add the following dependency to your requirements.txt

opencensus-ext-azure

As an example, my full requirements.txt when I was reproducing this issue looks like:

azure-functions
azure-functions-durable
opencensus-ext-azure

(2) Next, we’ll set-up the OpenCensus handler in your orchestrator file. You just need to add the following import statement

from opencensus.ext.azure.log_exporter import AzureLogHandler

and then provide your Application Insights InstrumentationKey to this snippet right here:

logger_opencensus = logging.getLogger('opencensus')
logger_opencensus.addHandler(
    AzureLogHandler(
        connection_string='InstrumentationKey=<the rest of your key here>'
    )
)

# the key has the following format: InstrumentationKey=<something>;IngestionEndpoint=<something>
# you can find it on the top-level banner when you access your Application Insights resource in the Azure portal

So, for example, the top of your orchestrator file might now look as follows

import logging
import json

import azure.functions as func
import azure.durable_functions as df
from opencensus.ext.azure.log_exporter import AzureLogHandler

logger_opencensus = logging.getLogger('opencensus')
logger_opencensus.addHandler(
    AzureLogHandler(
        connection_string='<the key>'
    )
)

def orchestrator_function(context: df.DurableOrchestrationContext):

(3) Now we can start logging! The API looks as follows:

logger_opencensus.info('your message here!')

So a simple orchestrator that logs to Application Insights might look like:

def orchestrator_function(context: df.DurableOrchestrationContext):
    if not context.is_replaying:
        logger_opencensus.info('my_debug_category hi there!')
    result1 = yield context.call_activity('Hello', "Tokyo")
    result2 = yield context.call_activity('Hello', "Seattle")
    result3 = yield context.call_activity('Hello', "London")
    return [result1, result2, result3]

(4) Now we can find your logs by going to your Application Insights resource in the Azure portal, then to the Logs option, and finally writing a query that looks like this:

traces
| where * contains "<a prefix of your log>"

So in the example above, I would query for:

traces
| where * contains "my_debug_category"

I’m attaching a picture of what your output should roughly look like. See below: mylogs

And that’s about it!

Try this out with a simple test orchestration and let us know if you’re able to locate your own logs in Application Insights. Afterwards, we can make your logs be more granular by including, in the prefix, the instanceID of your orchestration. You can access this value via context._instance_id. With that, we should have enough to find the data we’re looking for! Please let me know if anything about this was unclear, I realize this comment is rather long by now ⚡ . Thanks!

davidmrdavid on Apr 27, 2021

@davidmrdavid, it’s a bit hard to tell, but it appears that a failure in the fan-out call_actvity_with_retry created something weird with the Python implementation. We may want to see if we can repro this ourselves.

@pcopley, each of these stuck orchestration bugs can have fairly independent root causes, so I would recommend opening a separate ticket for your issue.

ConnorMcMahon on Apr 12, 2021