azure-functions-durable-extension: Message loss due to race conditions with ContinueAsNew

Summary

Message loss or instance termination may be observed if an orchestrator completes after calling ContinueAsNew and subsequently processes any other message before restarting.

Details

When the orchestrator completes execution after ContinueAsNew is called, the status is internally marked as “completed” and a new message is enqueued to restart it. During this window between completion and restart, it’s possible for other messages to arrive (for example, raised events or termination messages). Because the internal state of the orchestration is completed, those messages will be dropped. It’s also possible for the DTFx runtime to terminate the instance claiming possible state corruption.

Repro

Start with the counter sample
Create a new instance
Call RaiseEventAsync(“operation”, “incr”) multiple times without waiting

Expected: Calling ContinueAsNew multiple times in quick succession should never be an issue. Many workloads may require this, especially actor-style workloads.

Workaround: Have the client wait a few seconds before sending events that may cause the orchestrator to do a ContinueAsNew. This give the instance time to get a new execution ID.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 2
Comments: 16

Commits related to this issue

Adding some warning for eternal orchestration Until https://github.com/Azure/azure-functions-durable-extension/issues/67 is fixed. — committed to cfe84/azure-docs by cfe84 6 years ago

Most upvoted comments

Resolved in v1.8.0 release.

cgillum on Mar 16, 2019

I’ve done a quick analysis based on some experiments I ran a few months back and have updated the description at the top of this thread with my thoughts on how we could potentially fix this issue.

FYI @gled4er

cgillum on Apr 16, 2018

Yes, this one needs to be fixed. Was just playing around and ran into this immediately when creating a singleton function 😦

christiansparre on Nov 25, 2017

YOU, SIR, ARE THE BEST! \o/

cfe84 on Mar 17, 2019

Thanks for the info Chris. A bit sad to see as I thought the actor pattern was the most interesting one. I ran into the problem for a small testing/learning e-shop project. When I “spam” the “add product to cart”-button some really strange things happen. Hope you find a solution 😃

jedjohan on Dec 22, 2017

This one definitely needs to be solved, and it’s high on the list since it impacts reliability for an important scenario. The change needs to be made in the Azure Storage extension of the Durable Task Framework: https://github.com/Azure/durabletask/blob/azure-functions/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

cgillum on Nov 11, 2017