azure-functions-durable-extension: Message loss due to race conditions with ContinueAsNew

Summary

Message loss or instance termination may be observed if an orchestrator completes after calling ContinueAsNew and subsequently processes any other message before restarting.

Details

When the orchestrator completes execution after ContinueAsNew is called, the status is internally marked as “completed” and a new message is enqueued to restart it. During this window between completion and restart, it’s possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It’s also possible for the DTFx runtime to terminate the instance claiming possible state corruption.

Repro

  1. Start with the counter sample
  2. Create a new instance
  3. Call RaiseEventAsync(“operation”, “incr”) multiple times without waiting

Expected: Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.

Workaround: Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 2
  • Comments: 16

Commits related to this issue

Most upvoted comments

Resolved in v1.8.0 release.

I’ve done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue.

FYI @gled4er

Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function 😦

YOU, SIR, ARE THE BEST! \o/

Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I “spam” the “add product to cart”-button some really strange things happen. Hope you find a solution 😃

This one definitely needs to be solved, and it’s high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs