azure-webjobs-sdk-extensions: Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated?
@kshyju answer below works at the moment, but now we recently started to see the following trace in AppInsights for our AF Cosmos DB triggers : “Soon retries will not be supported for function ‘[Function Name]’. For more information, please visit http://aka.ms/func-retry-policies.”
The Retry examples section also shows “Retry policies aren’t yet supported when running in an isolated process.” and the Retries section reflects no support for the Cosmos DB trigger/binding.
What’s the path forward for AF Cosmos DB triggers running out-of-process?
Yes, retry policies are supported in isolated(out-of-process) function apps. You can enable it by adding the retry section to your host config. Here is an example host.json
{
"version": "2.0",
"retry": {
"strategy": "fixedDelay",
"maxRetryCount": 2,
"delayInterval": "00:00:03"
}
}
The reason why I’m asking is the documentation mentioning Retries require NuGet package Microsoft.Azure.WebJobs >= 3.0.23
That documentation which refers the usage of ExponentialBackoffRetry
attribute is for in-proc function apps.
Please give it a try and let us know if you run into any problems.
_Originally posted by @kshyju in https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934_
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 7
- Comments: 54 (19 by maintainers)
@sayyaari The change I added is for the 4.x extension that will release sometime next month, the document would need to be updated I guess
Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?
Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with “n/a” and “Not configurable” listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries
It’s finally on GA! Thank you very much @ealsur!
We can finally update now!
I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I’m sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.
This is the current behavior as of today. Let me circle back with the team to understand what is the migration path & update you.
Overview
Hello, after some extensive testing with the latest version of nuget packages for out-of-process dotnet cosmos change feed function here’s what I discovered the behavior to be. PLEASE correct me where I missed something 😄
Goal: My goal is to be able to use Azure Functions with Cosmos Change Feed as a way to reliably process changes coming from Cosmos DB. (Think replication to another collection, notifying consumers, etc.)
Problem: I found that the behavior of code written in the way one might expect to write it (i.e. throw exception in change feed processor function) still advances the change feed in edge cases making the provided tooling for Azure Function Change Feed Trigger into a “best effort” process instead of reliable processing (see scenarios below). This behavior means that changes in the Change Feed can be lost unexpectedly (to the developer writing code) in edge cases.
Environment
The failure conditions / edge cases I tried are meant to simulate:
NOTE: I did these tests with the
[FixedDelayRetry]
attribute on my function. It’s my understanding (and assumption) that this behavior is similar if not worse with out the retry attribute.Findings
NOTE: When I say “runtime process” I’m referring to
func.exe
from the Azure Function Core Tools locally. When I say “worker process” or “isolated process” I’m referring todotnet.exe
running my user function code locally.The change feed will still advance in all of these cases:
ExponentialBackoffRetry[]
orFixedDelayRetry[]
attributes will retry in-memory but if the runtime process is gracefully stopped (i.e. code deployment, environment restart, etc.) it will still advance the change feedEnvironment.FailFast()
in the User Function Code) will still advance the change feedEnvironment.Exit(-1)
) seems that it will hang indefinitely but when you go to stop the runtime process, it will advance the change feed.The change feed will not advance in any of these cases:
host.Run()
)Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover
. This crash of the worker process during startup must take place before thehost.Run()
inProgram.cs
. NOTE: If the worker process successfully restarts (ie.host.Run()
succeeds) it doesn’t matter that you continue to crash the process in the function code, the feed will still advance; you must crash the worker process beforehost.Run()
.NOTE: I did find it unusual that the runtime Cosmos trigger so was lenient with when it advances the change feed lease/tracker and expected it to be a lot more conservative. For example, even when I see this in the logs which clearly indicates a scenario where the worker is not happy during processing the change feed tracking is still advancing.
How to get reliable processing (workaround)
The only thing I was able to get to work was to use
Environment.FailFast()
in the User Function Code when encountering an issue that is not recoverable to kill the isolated worker processed. I must crash the worker process, simply throwing exceptions in user code or evenEnvironment.Exit(1)
doesn’t work here.Then, when the runtime process restarts the worker process, I need to detect that failure condition in the bootstrap / startup code (inside the
Program.cs
) of the worker process and crash the process beforehost.Run()
is called. Crashing here can be throwing unhandled exception orEnvironment.FailFast()
or evenEnvironment.Exit([non-zero value])
with non-zero exit code it doesn’t matter as long ashost.Run()
isn’t called and the process exits in a failure condition.In this scenario, the runtime process will restart the worker process a few times then give up with error message
Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover
and the runtime process will exit. The Change Feed is not advanced and the next time the runtime process is started it will attempt to process the change in the feed again.NOTE: “Just putting the message on a queue from code in your change feed function” (i.e. ServiceBus or Storage) won’t work here because that operation can also fail and if the current in-progress change is not properly retried it will be lost. Even in the “put it on a queue” scenario, we need to reliably get the value from change feed to the queue without dropping changes. Due to the above edge cases that isn’t possible without this workaround from that I can see.
Not at this point. As I mentioned in https://github.com/Azure/azure-webjobs-sdk-extensions/issues/783#issuecomment-1216727630, aligning with the Change Feed Processor, the only potential alternatives we could offer for the Cosmos DB Trigger is either an infinite retry (retry on any error) or no retry. There is no exponential backoff retry support on the Change Feed Processor. When checkpointing is aborted due to a processing error, the lease is released, any instance (same or other) could claim that and retry, so storing the number of previous attempts on an instance would not apply.
Unless the Retry Policies are exposed to Extension authors in a way that we could invoke/call/signal them in some way but the model from what I understand in this thread is that Retry Policies exist and apply outside of the Extension scope.
Because this is not exclusive to the isolated model, we should move this item to the extension repo. My understanding is that this would be a feature request for the Cosmos DB extension.
Thanks for the feedback. I opened this issue specifically for out-of-proc functions, so will await the response of @shibayan or @Ved2806 before closing.
There are two separate issues here.
The first one is that certainly I have observed that deploying changes to the function app means that batches end up being aborted (maybe due to thread abort) and the lease is still updated so the net effect is that the batch is just missed. After using this preview feature I did not observe any instances of this. When it is removed I no longer will have confidence in any method of deployment that avoids this, (it was never got to the bottom of this issue in the linked thread)
The second is that I don’t want to provision a whole load of largely redundant resources for the 0.001% case (and write code to then merge from the fallback storage to the main storage). I just want the checkpoint not to be written and for it to be retried later (as happens with the exponential retry). Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?
(Typical use case for me is to use the Change feed to upsert to a different Cosmos - e.g. to maintain a materialized view of a collection so pretty much all errors will be transient such as throttling and succeed on retry)
Just to be clear, the retry policy/attribute does indeed work today aka with lease checkpointing working correctly. It appears it using an in memory retry policy similar to Polly.
The fundamental issue is not that
CosmosDBTrigger
cannot be configured with a retry policy, but thatCosmosDBTrigger
advances the lease when a Function fails to execute.I believe this could be solved if
CosmosDBTrigger
instead of retry policy could use exponential backoff to extend polling time for Change Feed without advancing the lease on Functions failure.@ealsur Yes, Same list. The logic is somewhere in the functions host I believe.
I agree, we built our change feed processing with the retries in mind and we were just about to roll-out our solution. While moving to webjobs with our own change feed processor is a viable option, we would be losing other benefits that azure functions can provide. Any guidance for cosmos db change feed retries would be appreciated.
Hi @ealsur Could you please help with this issue?