azure-functions-durable-python: DurableOrchestrationClient.start_new raises exception when orchestrator instance exists with the same instance id instead of silent replace.

🐛 Describe the bug In the python azure-functions-durable 1.1.6 SDK the DurableOrchestrationClient.start_new() method’s documentation says:

Start a new instance of the specified orchestrator function. If an orchestration instance with the specified ID already exists, the existing instance will be silently replaced by this new instance.

Instead of silently replacing a Pending instance it raises exception:

Exception: Exception: {‘Message’: ‘Something went wrong while processing your request’, ‘ExceptionMessage’: ‘An Orchestration instance with the status Pending already exists.’, ‘ExceptionType’: ‘System.InvalidOperationException’, ‘StackTrace’: ’ at DurableTask.AzureStorage.AzureStorageOrchestrationService.CreateTaskOrchestrationAsync(TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1570\n at DurableTask.Core.TaskHubClient.InternalCreateOrchestrationInstanceWithRaisedEventAsync(String orchestrationName, String orchestrationVersion, String orchestrationInstanceId, Object orchestrationInput, IDictionary2 orchestrationTags, OrchestrationStatus[] dedupeStatuses, String eventName, Object eventData, Nullable1 startAt) in //src/DurableTask.Core/TaskHubClient.cs:line 614\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableOrchestrationClient.StartNewAsync[T](String orchestratorFunctionName, String instanceId, T input) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 210\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.HttpApiHandler.HandleStartOrchestratorRequestAsync(HttpRequestMessage request, String functionName, String instanceId) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\HttpApiHandler.cs:line 875\n at Microsoft.Azure.WebJobs.Extensions.DurableTask.HttpApiHandler.HandleRequestAsync(HttpRequestMessage request) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\HttpApiHandler.cs:line 306’}

🤔 Expected behavior

What should have happened?

Based on the documentation I would expect it to replace the existing instance with a new one without raising exception.

☕ Steps to reproduce An orchestration client triggers an orchestration function. In the orchestration client it checks the satatus: AZURE_INACTIVE_ORCHESTRATOR_STATUS_SET = { OrchestrationRuntimeStatus.Completed, OrchestrationRuntimeStatus.Failed, OrchestrationRuntimeStatus.Terminated, OrchestrationRuntimeStatus.Pending, None, }

if instance.runtime_status in AZURE_INACTIVE_ORCHESTRATOR_STATUS_SET: instance_id = await client.start_new(INSTANCE_ID, instance_id=INSTANCE_ID, client_input={'time': time_list[i]}) logging.info(f"Started orchestration with ID = '{instance_id}'.")

So if the orchestration instance in Pending it should create a new one. The status of the orchestration instance can be flipped in the corresponding storage account table.

it can be reproduced locally in VScode as well.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 44 (1 by maintainers)

Most upvoted comments

The discrepancy you’re seeing might be because there are multiple processes involved which track time differently. For Python, there is the functions host process, which is where the triggers run, and then there’s the python process, which is where your code executes.

[functions_host_process] --> (gRPC) --> [python_process]

The process for invoking a function involves the following steps:

  1. The functions host process receives a message from one of the task hub queues.
  2. The functions host process invokes the python worker to execute the corresponding function code
  3. The functions host receives the result of the function execution and updates the orchestration history accordingly.

For your specific example, steps 1-3 are taking 17 minutes. The 230ms is probably the time that your function code actually runs. The remaining time is most likely time spent waiting for a Python thread to come available inside the python process to execute your code. Depending on your version of Python and your app configuration, you might be running on a single-threaded Python worker, in which case any concurrent operations will be blocked waiting for the single Python thread to finish executing other function executions.

This is actually a really common problem that’s documented here. There are several workarounds that you could consider:

  1. Increase the number of python threads as described in the article I linked to.
  2. Increase the number of python worker processes, also described in that article.
  3. Decrease the function concurrency in your app to match the python concurrency by changing the Durable Functions-specific concurrency throttles described here and here. Decreasing concurrency allows orchestration and activity invocations to be more effectively load-balanced to other app instances, and can help create a signal to the scale infrastructure (if you’re using the Consumption or Elastic Premium SKUs) that your app needs to scale out.

Try adjusting some of these configuration values to see if that improves the reported end-to-end latencies of your functions.

Normally the “locks” on these messages are never released, but instead the message is deleted when the processing is complete. Even an entity failing with an error counts as a success, so the message will be deleted. Only in the event of a major failure, like a process restart, will the message become visible again (it takes 5 minutes before the default visibility timeout expires).

If the number of queue messages increases, it means there are more events to be processed by your orchestrations and/or entities.

Hi @davidmrdavid ,

Managed to capture the issue this week: The first name is the isntance id the second is the name of the actual function.

prod-epex-scraper-func orchestrate_parse_order_da orchestrate_parse 2023-02-18T23:10:00Z 20ms Pending

orchestrate_parse_order_history orchestrate_parse 2023-02-18T23:07:00Z 13ms Pending

orchestrate_parse_result_da orchestrate_parse 2023-02-18T23:07:00Z 14ms Pending

orchestrate_parse_result_history orchestrate_parse 2023-02-18T23:05:00Z 16ms Pending

orchestrate_scrape_archive_order orchestrate_scrape 2023-02-18T23:10:00Z 13ms Pending

orchestrate_scrape_archive_result orchestrate_scrape 2023-02-18T23:05:00Z 20ms Pending

orchestrate_scrape_current_order orchestrate_scrape 2023-02-18T23:05:00Z 31ms Pending

orchestrate_scrape_current_result orchestrate_scrape 2023-02-18T23:17:00Z 14ms Pending

test-epex-scraper-func orchestrate_parse_order_da orchestrate_parse 2023-02-16T12:40:00Z 26ms Pending

orchestrate_parse_order_history orchestrate_parse 2023-02-16T12:37:00Z 42ms Pending

orchestrate_parse_result_da orchestrate_parse 2023-02-16T12:37:00Z 26ms Pending

orchestrate_parse_result_history orchestrate_parse 2023-02-16T12:35:00Z 130ms Pending

orchestrate_scrape_archive_order orchestrate_scrape 2023-02-16T12:40:00Z 48ms Pending

orchestrate_scrape_archive_result orchestrate_scrape 2023-02-16T12:35:00Z 53ms Pending

orchestrate_scrape_current_order orchestrate_scrape 2023-02-16T12:35:00Z 22ms Pending

orchestrate_scrape_current_result orchestrate_scrape 2023-02-16T12:32:00Z 138ms Pending

Hi @davidmrdavid ,

Will provide the instance ID and the time stamp when it happens next.