azure-functions-durable-extension: Activity functions fire multiple times for same input and don't all complete

Description

Occasionally multiple instances of an activity function start at the same time. And occasionally activity functions stop running without returning or throwing an exception.

Our function app loads large amounts of data from a file and stores it to a database. We separate the records into batches and fan out. Because of our database writes, our activity functions are not idempotent (though we attempted to simulate idempotency with one workaround.)

Expected behavior

We expected that only one instance of an activity function would be started at the same time (i.e. if a second instance is created, it would not be at almost exactly the same time).

We also expected that if an activity function is started, it would run to completion. This is the assumption that we based at least one of our workarounds on.

Actual behavior

Original actual behavior: we observed duplicate records being added to the database (or exceptions when unique constraints were violated). This is what alerted us to the duplicate activity functions.

We added code to use table storage to ensure that only one activity function would add records (the “primary instance” of the activity function for the batch). Duplicate instances of that activity function for the batch would return immediately. In this case, we observed that not all the records would be added to the database. We assumed that the framework would stop still-running instances with the same input when another instance completed. The primary instance took longer to complete, so it would be terminated.

Then we added code so duplicate instances of activity functions for a batch would wait for the primary instance to complete (i.e. simulating idempotency, sort of). In this case, we observed several instances where the primary instance never completed, but the primary instance also did not throw any exceptions. The duplicate instance eventually timed out and threw an exception.

Our standard test splits the client data (where we see the issues) into 40 batches. For the instance we reference below, we saw 45 instances of SaveClients executed.

Attempted workarounds

Upgraded to 2.3.0 (seemed somewhat better)
Added a guard to limit to one writing instance per batch
Caused duplicate instances to wait on the primary instance to complete

App Details

Durable Functions extension version (e.g. v1.8.3): 2.3.0 (also observed with 2.2.2, 2.3.0 seems somewhat better)
Azure Functions runtime version (1.0 or 2.0): 2.0
Programming language used: C#

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

Timeframe issue observed: 2020-09-03 8:20-8:30 US Central time
Function App name: GlobalDataProd
Function name(s): LoadMarketJson -> OrchestrateDataLoad -> SaveClients
Azure region: CentralUS
Orchestration instance ID(s): c2a41d9d35f14bccb3b806e0a99da556
Azure storage account name: globaldatafunctionprod

(We have other instances of failures if needed.)

If you don’t want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it’s extremely difficult to look up information.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (1 by maintainers)

Most upvoted comments

Thanks @mattschoutends for these details. I found your orchestration instance in our internal telemetry and can confirm that it did run into some really odd behavior. I’ll continue to take a look and see if I can determine the root cause and a possible fix/workaround.

cgillum on Sep 3, 2020