azure-webjobs-sdk-extensions: Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated?

@kshyju answer below works at the moment, but now we recently started to see the following trace in AppInsights for our AF Cosmos DB triggers : “Soon retries will not be supported for function ‘[Function Name]’. For more information, please visit http://aka.ms/func-retry-policies.

The Retry examples section also shows “Retry policies aren’t yet supported when running in an isolated process.” and the Retries section reflects no support for the Cosmos DB trigger/binding.

What’s the path forward for AF Cosmos DB triggers running out-of-process?

Yes, retry policies are supported in isolated(out-of-process) function apps. You can enable it by adding the retry section to your host config. Here is an example host.json

{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 2,
    "delayInterval": "00:00:03"
  }
}

The reason why I’m asking is the documentation mentioning Retries require NuGet package Microsoft.Azure.WebJobs >= 3.0.23

That documentation which refers the usage of ExponentialBackoffRetry attribute is for in-proc function apps.

Please give it a try and let us know if you run into any problems.

_Originally posted by @kshyju in https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934_

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 7
  • Comments: 54 (19 by maintainers)

Most upvoted comments

@sayyaari The change I added is for the 4.x extension that will release sometime next month, the document would need to be updated I guess

Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?

Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with “n/a” and “Not configurable” listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries

It’s finally on GA! Thank you very much @ealsur!

We can finally update now!

I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I’m sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.

@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?

I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: “Soon retries will not be supported for function ‘[Function Name]’. For more information, please visit http://aka.ms/func-retry-policies.

This is the current behavior as of today. Let me circle back with the team to understand what is the migration path & update you.

Overview

Hello, after some extensive testing with the latest version of nuget packages for out-of-process dotnet cosmos change feed function here’s what I discovered the behavior to be. PLEASE correct me where I missed something 😄

Goal: My goal is to be able to use Azure Functions with Cosmos Change Feed as a way to reliably process changes coming from Cosmos DB. (Think replication to another collection, notifying consumers, etc.)

Problem: I found that the behavior of code written in the way one might expect to write it (i.e. throw exception in change feed processor function) still advances the change feed in edge cases making the provided tooling for Azure Function Change Feed Trigger into a “best effort” process instead of reliable processing (see scenarios below). This behavior means that changes in the Change Feed can be lost unexpectedly (to the developer writing code) in edge cases.

Environment

func --version
4.0.5198

dotnet --version
7.0.304

*.csproj
<Project Sdk="Microsoft.NET.Sdk">
    <PropertyGroup>
        <TargetFramework>net7.0</TargetFramework>
        <AzureFunctionsVersion>V4</AzureFunctionsVersion>
        <OutputType>Exe</OutputType>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>
    <ItemGroup>
        <PackageReference Include="Azure.Storage.Queues" Version="12.14.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.14.1" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.3.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.0.13" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.1.2" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.10.0" />
    </ItemGroup>
...
</Project>

OS
Edition	Windows 11 Pro
Version	22H2
Installed on	‎10/‎15/‎2022
OS build	22621.1848
Experience	Windows Feature Experience Pack 1000.22642.1000.0

The failure conditions / edge cases I tried are meant to simulate:

  • The function runtime process crashing
  • The isolated worker process crashing
  • A dependency of the user function code being down for an extended period of time (i.e. a service outage)
  • The user function code having a bug requiring a code deployment (wherein the function runtime is gracefully shutdown)

NOTE: I did these tests with the [FixedDelayRetry] attribute on my function. It’s my understanding (and assumption) that this behavior is similar if not worse with out the retry attribute.

Findings

NOTE: When I say “runtime process” I’m referring to func.exe from the Azure Function Core Tools locally. When I say “worker process” or “isolated process” I’m referring to dotnet.exe running my user function code locally.

The change feed will still advance in all of these cases:

  • Unhandled Exceptions in User Function Code will still advance the change feed
  • Add ExponentialBackoffRetry[] or FixedDelayRetry[] attributes will retry in-memory but if the runtime process is gracefully stopped (i.e. code deployment, environment restart, etc.) it will still advance the change feed
  • Repeated throwing exception in Function will still advance the change feed
  • Repeated crashing the worker process (i.e. with Environment.FailFast() in the User Function Code) will still advance the change feed
  • Exiting the worker process with a non-zero status code (i.e. Environment.Exit(-1)) seems that it will hang indefinitely but when you go to stop the runtime process, it will advance the change feed.

The change feed will not advance in any of these cases:

  • The runtime process is killed or terminates unexpectedly (i.e. host VM crashed, Process is terminated via task manager, etc.).
  • The runtime process starts successfully, but the worker process does not start successfully. (i.e. The worker process crashes before host.Run())
  • The runtime and worker processes start and begin to process a message but then the worker (isolated) process crashes and the auto-restart of the worker process (which is controlled by the runtime process) exceeds the number of retries (i.e. the worker process crashed and never successfully started) with this Error Message: Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover. This crash of the worker process during startup must take place before the host.Run() in Program.cs. NOTE: If the worker process successfully restarts (ie. host.Run() succeeds) it doesn’t matter that you continue to crash the process in the function code, the feed will still advance; you must crash the worker process before host.Run().

NOTE: I did find it unusual that the runtime Cosmos trigger so was lenient with when it advances the change feed lease/tracker and expected it to be a lot more conservative. For example, even when I see this in the logs which clearly indicates a scenario where the worker is not happy during processing the change feed tracking is still advancing.

Language Worker Process exited. Pid=43372.
Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover
Lease 0 encountered an unhandled user exception during processing.

How to get reliable processing (workaround)

The only thing I was able to get to work was to use Environment.FailFast() in the User Function Code when encountering an issue that is not recoverable to kill the isolated worker processed. I must crash the worker process, simply throwing exceptions in user code or even Environment.Exit(1) doesn’t work here.

Then, when the runtime process restarts the worker process, I need to detect that failure condition in the bootstrap / startup code (inside the Program.cs) of the worker process and crash the process before host.Run() is called. Crashing here can be throwing unhandled exception or Environment.FailFast() or even Environment.Exit([non-zero value]) with non-zero exit code it doesn’t matter as long as host.Run() isn’t called and the process exits in a failure condition.

In this scenario, the runtime process will restart the worker process a few times then give up with error message Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover and the runtime process will exit. The Change Feed is not advanced and the next time the runtime process is started it will attempt to process the change in the feed again.

NOTE: “Just putting the message on a queue from code in your change feed function” (i.e. ServiceBus or Storage) won’t work here because that operation can also fail and if the current in-progress change is not properly retried it will be lost. Even in the “put it on a queue” scenario, we need to reliably get the value from change feed to the queue without dropping changes. Due to the above edge cases that isn’t possible without this workaround from that I can see.

Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?

Not at this point. As I mentioned in https://github.com/Azure/azure-webjobs-sdk-extensions/issues/783#issuecomment-1216727630, aligning with the Change Feed Processor, the only potential alternatives we could offer for the Cosmos DB Trigger is either an infinite retry (retry on any error) or no retry. There is no exponential backoff retry support on the Change Feed Processor. When checkpointing is aborted due to a processing error, the lease is released, any instance (same or other) could claim that and retry, so storing the number of previous attempts on an instance would not apply.

Unless the Retry Policies are exposed to Extension authors in a way that we could invoke/call/signal them in some way but the model from what I understand in this thread is that Retry Policies exist and apply outside of the Extension scope.

Because this is not exclusive to the isolated model, we should move this item to the extension repo. My understanding is that this would be a feature request for the Cosmos DB extension.

Thanks for the feedback. I opened this issue specifically for out-of-proc functions, so will await the response of @shibayan or @Ved2806 before closing.

There are two separate issues here.

The first one is that certainly I have observed that deploying changes to the function app means that batches end up being aborted (maybe due to thread abort) and the lease is still updated so the net effect is that the batch is just missed. After using this preview feature I did not observe any instances of this. When it is removed I no longer will have confidence in any method of deployment that avoids this, (it was never got to the bottom of this issue in the linked thread)

The second is that I don’t want to provision a whole load of largely redundant resources for the 0.001% case (and write code to then merge from the fallback storage to the main storage). I just want the checkpoint not to be written and for it to be retried later (as happens with the exponential retry). Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?

(Typical use case for me is to use the Change feed to upsert to a different Cosmos - e.g. to maintain a materialized view of a collection so pretty much all errors will be transient such as throttling and succeed on retry)

Just to be clear, the retry policy/attribute does indeed work today aka with lease checkpointing working correctly. It appears it using an in memory retry policy similar to Polly.

The fundamental issue is not that CosmosDBTrigger cannot be configured with a retry policy, but that CosmosDBTrigger advances the lease when a Function fails to execute.

I believe this could be solved if CosmosDBTrigger instead of retry policy could use exponential backoff to extend polling time for Change Feed without advancing the lease on Functions failure.

@ealsur Yes, Same list. The logic is somewhere in the functions host I believe.

Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?

Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with “n/a” and “Not configurable” listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries

I agree, we built our change feed processing with the retries in mind and we were just about to roll-out our solution. While moving to webjobs with our own change feed processor is a viable option, we would be losing other benefits that azure functions can provide. Any guidance for cosmos db change feed retries would be appreciated.

Hi @ealsur Could you please help with this issue?