azure-functions-durable-extension: JobHost stops when durable function is running

We have a durable function that runs an activity function 100 times, one activity function at a time. Each activity function takes approximately 30 seconds to complete. We’ve noticed that after running for around 20 minutes the JobHost is stopped and all activity is suspended for between 10 minutes to an hour before the JobHost starts up with a different HostInstanceId and the durable function continues.

This happens every time with 100 activities. If we reduce the number of times the activity function is called to 10 the issue does not occur. The issue also does not occur when running on a fixed S1 App Service plan. It only occurs on the consumption plan.

The activity function is using a single instance of HttpClient to call an API 100 times, one request at a time.

The activity function uses a CancellationToken which is passed onto the HttpClient requests but it does not appear to be cancelled when the JobHost is stopped. If the activity function exceeds its max runtime then the CancellationToken does appear to be cancelled and the function is gracefully aborted. The CancellationToken also does not appear to be cancelled when running in a fixed app service plan and the Azure Portal UI is used to stop the app service. I see there is an existing fixed bug 4251 for a similar issue but I’m not sure what release the fix is in.

2019-06-25T14:14:25.158	Stopping JobHost	"Category":"Microsoft.Azure.WebJobs.Hosting.JobHostService","HostInstanceId":"98c7727a-d36f-4788-9b43-f31dd4a6c519"									
2019-06-25T14:32:03.491	Starting JobHost	"Category":"Microsoft.Azure.WebJobs.Hosting.JobHostService","HostInstanceId":"1d604326-f138-421a-85d3-3042c17a1ef3"

Investigative information

Durable Functions extension version: 1.0.29
Function App version (1.0 or 2.0): 2.0
Programming language used: C#
Timeframe issue observed: 2019-06-25T14:14:25.158 UTC
Function App name: Starling-IdentityDataIngestion-dev
Function name(s): GraphOrchestratorFn & GraphActivityFn
Region: West US
Orchestration instance ID(s): 6c5f3c2bd674454c91976e60249540a1

About this issue

Original URL
State: open
Created 5 years ago
Comments: 18 (3 by maintainers)

Most upvoted comments

Regarding using timer triggered functions as a workaround, this only works if your app is running on a single instance. If it’s scaled out to multiple instances, the timer trigger function will only protect one of them.

For folks still running into this, it would be helpful to mention which plan type and OS you’re using since there are slightly different behaviors across Consumption, Elastic Premium, Windows, and Linux. For example, Elastic Premium apps have a bit more protection against this behavior since EP needs to support longer-running executions. Work is being done by the App Service and Functions teams to help solve the “idle” issue for everyone, but progress on deploying updates everywhere has been slowed by various issues that some customers are running into when we allow functions to keep running before shutting down the host. We on the DF team don’t have a lot of great visibility into the progress of this work, however, so if you need a more immediate or official answer, then I suggest you open a support case with Azure.

cgillum on Jul 6, 2021

@HobbsB I’ll leave this issue open until a fix to Azure App Service is in place. In the meantime, your workaround to use an App Service plan sounds like a good one.

cgillum on Jul 17, 2019

Interesting. I think @olitomlinson’s explanation is correct regarding the 20 minutes of inactivity. The queue messages are probably getting picked up so fast that the scale controller never notices that any messages enter the queue, and therefore thinks the app is idle. We’ll need to think of a good way to account for this.

Regarding this:

the JobHost is stopped and all activity is suspended for between 10 minutes to an hour before the JobHost starts up with a different HostInstanceId and the durable function continues.

This long delay surprises me. Ideally it should resume immediately, or in 5 minutes at the longest (this is the default visibility timeout for queue messages). In host.json, if you decrease the workItemQueueVisibilityTimeout to a smaller value, does that change how long it takes to recover?

I checked about the CancellationToken fix for https://github.com/Azure/azure-functions-host/issues/4251, and I was told that it’s currently deploying as part of Azure Functions v2.0.125549. Most likely that will finish deploying everywhere by early next week. I’m wondering if that would fix the delay problem as well. Can you let me know if you still have this problem after your function app gets upgraded?

cgillum on Jun 27, 2019

@olitomlinson Thank you, that looks like it could be the issue. I’ve tested running a simple function every 10 minutes while the Durable Function is active and for the first time the JobHost did not stop. I’ll run a few more tests in the morning to confirm.

Previously the JobHost would be stopped even though it was activity processing an activity function and had been processing at least one every minute since the durable function workload was started.

Successful orchestration instance Id: b5f5b77c5a33419285379af85bb8adf0

ian63 on Jun 25, 2019