azure-functions-durable-extension: Activity functions fire multiple times for same input and don't all complete
Description
Occasionally multiple instances of an activity function start at the same time. And occasionally activity functions stop running without returning or throwing an exception.
Our function app loads large amounts of data from a file and stores it to a database. We separate the records into batches and fan out. Because of our database writes, our activity functions are not idempotent (though we attempted to simulate idempotency with one workaround.)
Expected behavior
We expected that only one instance of an activity function would be started at the same time (i.e. if a second instance is created, it would not be at almost exactly the same time).
We also expected that if an activity function is started, it would run to completion. This is the assumption that we based at least one of our workarounds on.
Actual behavior
Original actual behavior: we observed duplicate records being added to the database (or exceptions when unique constraints were violated). This is what alerted us to the duplicate activity functions.
We added code to use table storage to ensure that only one activity function would add records (the “primary instance” of the activity function for the batch). Duplicate instances of that activity function for the batch would return immediately. In this case, we observed that not all the records would be added to the database. We assumed that the framework would stop still-running instances with the same input when another instance completed. The primary instance took longer to complete, so it would be terminated.
Then we added code so duplicate instances of activity functions for a batch would wait for the primary instance to complete (i.e. simulating idempotency, sort of). In this case, we observed several instances where the primary instance never completed, but the primary instance also did not throw any exceptions. The duplicate instance eventually timed out and threw an exception.
Our standard test splits the client data (where we see the issues) into 40 batches. For the instance we reference below, we saw 45 instances of SaveClients
executed.
Attempted workarounds
- Upgraded to 2.3.0 (seemed somewhat better)
- Added a guard to limit to one writing instance per batch
- Caused duplicate instances to wait on the primary instance to complete
App Details
- Durable Functions extension version (e.g. v1.8.3): 2.3.0 (also observed with 2.2.2, 2.3.0 seems somewhat better)
- Azure Functions runtime version (1.0 or 2.0): 2.0
- Programming language used: C#
If deployed to Azure
We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!
- Timeframe issue observed: 2020-09-03 8:20-8:30 US Central time
- Function App name: GlobalDataProd
- Function name(s): LoadMarketJson -> OrchestrateDataLoad -> SaveClients
- Azure region: CentralUS
- Orchestration instance ID(s): c2a41d9d35f14bccb3b806e0a99da556
- Azure storage account name: globaldatafunctionprod
(We have other instances of failures if needed.)
If you don’t want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it’s extremely difficult to look up information.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (1 by maintainers)
Thanks @mattschoutends for these details. I found your orchestration instance in our internal telemetry and can confirm that it did run into some really odd behavior. I’ll continue to take a look and see if I can determine the root cause and a possible fix/workaround.