azure-functions-durable-extension: Orchestration getting stuck while getting the lock

Description

The full description with code snippets, screenshots, issue samples, etc is here https://github.com/Azure/azure-functions-durable-extension/discussions/2530

Expected behavior

Acquire the lock in seconds at most, not minutes or hours.

Actual behavior

It seems the orchestration is getting stuck while acquiring the locks of the entities intervening in the orchestration.

Known workarounds

Reset the durable storage account and the func app storage account

App Details

Durable Functions extension version (e.g. v1.8.3): 2.10.0
Azure Functions runtime version (1.0 or 2.0): 2
Programming language used: C#

If deployed to Azure

Timeframe issue observed:
Function App name: orders-saga
Azure region: West-US
Azure storage account name: ordersagav2

About this issue

Original URL
State: open
Created a year ago
Comments: 128 (61 by maintainers)

Most upvoted comments

I have a longer update to provide, but I wanted to very quickly let you know that the new logs tell me with high precision what went wrong and I’m working to release another private package.

Once I have it, I’ll post it with my analysis of the most recent incident.

davidmrdavid on Sep 11, 2023

@vany0114: That makes sense. I don’t believe at this time that this is an issue caused by EH throughput.

I’ll keep you posted on this thread regarding a new mitigation package but I don’t think I’ll be able to provide it today: I’m trying to first look at the logs to understand what’s happening here. I want to avoid giving you a package that makes things worse. At this point, I can already confirm again, from telemetry, that the mitigation is working as expected, and I can also see the error you report about client operations getting cancelled due to shutdowns, but I have yet to determine the end-to-end root cause of the shutdown.

I see there was a spike of these issues starting on 9/26, I’m diving deeper into that next.

davidmrdavid on Oct 5, 2023

@vany0114:

The Netherite SME and I have identified an area of the code that we think explains the behavior we’re seeing with your app. To put it succinctly, it’s possible for the “FASTER” key value store (one of our dependencies) to silently refuse to start the checkpointing process, and Netherite is not responding to this. We’re prototyping a patch to account for this possibility and eventually terminate the partition if it’s been over a minute of continuous refusal by FASTER to checkpoint the partition event log (which as you know is otherwise growing too large).

We don’t have proof that this is exactly what is happening with your app, but it’s the most realistic theory that is consistent with the logs we have. If we had a repro with the new package, we’d be able to confirm for certain that this is the problem, but until then, this mitigation is our best bet. I’m prototyping this now and trying to provide it to you as fast as possible.

davidmrdavid on Aug 31, 2023

@vany0114: that test took longer than expected as I needed to remove a rather noisy log from my first private package and then re-test the new payload, but I’m done validating it.

Please utilize this new private package (please ensure the suffix is -checkpoint-logs-2 instead of -checkpoint-logs-1): https://durabletaskframework.visualstudio.com/Durable Task Framework CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite.AzureFunctions/overview/1.4.0-checkpoint-logs-2

To be able to download this package, you will probably need to add the following source to your nuget.config:

    <add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

Please let me know once you deploy a new version of your app with this dependency change, and please let me know once you identify a repro with it (and provide us with the usual identifying information) so that we can take a look at our new telemetry. Thank you so much for your assistance.

davidmrdavid on Aug 29, 2023

Note that I’m not the Netherite expert in our team, but I still wanted to share my observations so far, but note they are subject to correction:

I’m looking at the logs for {"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"cb03b4f3-a26a-4465-84d1-9e44a7ae65e2","checkNumber":"28584"} and I’m seeing something that seems rather Netherite specific: the partition hosting EntityId “@ticketnumberkey @7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1” (partition 7) appears to be assigned to workers in quick succession from ~8:30 UTC to ~9:22 UTC.

My theory is that since this partition is constantly getting re-assigned to new owner workers, we are not able to effectively process its messages and therefore your locking request is taking a long while to process. At ~9:29, partition 7 finally gets a stable worker assignment and it processes the lock request for “@originorderkey @7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1”, which in turn requests the lock for the next entity “@scheduledkey @7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1” and so on until all entities are locked and the original orchestrator is unblocked. This process occurs quickly once we’re able to obtain the lock for “”@originorderkey @7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1" at ~9:22, so I’m currently thinking that the frequent re-assignment of partition 7 across workers was to blame for the delay.

I do not know why partition 7 is being re-assigned to workers so often. I will need assistance from the Netherite SME to figure that out.

I’m going to analyze your other Netherite instanceIDs to see if the same behavior is to blame in those ones as well. From there, there next steps should be clearer.

davidmrdavid on Aug 23, 2023

@vany0114: I responded in our new thread: https://github.com/Azure/azure-functions-durable-extension/issues/2603#issuecomment-1751435249

davidmrdavid on Oct 6, 2023

@vany0114: Yes, I’m back since this morning; apologies for the lack of follow up here. I’m going to quickly look at making the mitigation more aggressive (less time to fire) and provide you an update on the restarts after reviewing the telemetry around it. My hunch, based on previous observations, is that the restarts are occurring due to OOMs/CPU consumption in the app, but I have yet to confirm that. If that hunch is correct, upgrading to a larger VM size may be the right solution here; but that’s just my unsubstantiated first impressions.

davidmrdavid on Oct 4, 2023

Hey @vany0114: Just wanted to provide an update here.

I just took a look at the behavior of your app, in relation to this issue, for the last coming days. I notice the mitigation firing somewhat regularly (which isn’t ideal, it tells me your application may be unstable for other reasons) but the mitigation does appear to be doing it’s job and keeping your partition log length in check.

I’ll keep an eye out on its behavior for a few more days. If the issue has not reproduced by Friday EOD, I was hoping to close this issue as well as the support ticket associated with this thread. In the meantime, the team will be working to merge the mitigation we shared with you into the mainline Netherite package. If the issue re-occurs, you can re-open this thread and we’ll continue the conversation, though hopefully that will not be necessary 😃.

davidmrdavid on Sep 20, 2023

The latest private package is now out under version 1.4.0-checkpoint-logs-5: https://durabletaskframework.visualstudio.com/Durable Task Framework CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite.AzureFunctions/overview/1.4.0-checkpoint-logs-5

As for the analysis:

I visualized your partition log lengths over the past 2 days and encountered 2 peaks. The first peak is for partition 10, which has a log length of ~1.4GB at 9/9 9:44, and then we have partition 9 which has a log length of ~4GB at 9/10 11:59. This is broadly visualized in the graph below.

MicrosoftTeams-image (4)

I decided to focus on the second peak, for partition 9.

Narrowing in on that, I decided to check how many times per hour the checkpoint state machine returns to its initial state, which means a full checkpoint occurred. It’s clear from this that checkpointing stopped occurring some time after 9/9 16:00 until 9/10 12:00. Below is a table showing the number of times checkpointing occured per 1 hour bins from 9/9 13:00 to 9/10 13:00. From there, you can see there’s several hours without any checkpoints between 9/9 16:00 to 9/10 12:00.

PreciseTimeStamp	count_
2023-09-09 13:00:00.0000000	31
2023-09-09 14:00:00.0000000	34
2023-09-09 15:00:00.0000000	60
2023-09-09 16:00:00.0000000	46
2023-09-10 12:00:00.0000000	3

</span></span>

</body> </html>

Now this is where the custom logs come in to help us dig deeper, but first I need to clarify that this checkpointing logic has 3 steps: (1) the “compaction step” (2) the index checkpoint (3) the store checkpoint.

When this investigation began, we knew were getting stuck in the checkpoint codepath, but were unable to determine what step was to blame. Our first fix targetted steps (2) and (3) because they seemed the most suspect, but after this message it became clear to us that it was actually the “compaction” step that was getting stuck. So the package with version 1.4.0-checkpoint-logs-4 contained detection logic to identify this case and in response it would try to terminate all active processing for to this partition so that it could restart.

In the logs I have for this latest reproduction, I do see the detection logic firing, but I made the mistake of failing to guarantee that all resources held by that partition were released / discarded / free’d, which nullified the effect of the mitigation. That was my mistake. In my latest release, I forced a local reproduction of the error and checked to make sure all the expected resources got free’d up.

All that being said: I realize this issue has been going on for a bit now so I appreciate your cooperation with us. Please do try out the latest package for stability and telemetry’s sake. We’ll continue monitoring the situation from there on.

One final concern I have is why your application is able to trigger this issue so frequently, whereas others aren’t. I haven’t been able to figure that out, but my first hunch (which I’d need to confirm with the Netherite SME) is that it may be related to very low CPU availability for checkpointing, as I would expect compaction to be a CPU heavy operation; but that is to be confirmed.

davidmrdavid on Sep 12, 2023

@davidmrdavid the latency issue locking the entities showed up again 😢 These are some orchestrations affected:

{"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"d56a1e68-232d-4d78-aa4d-9b60ac020e9b","checkNumber":"18559"}
{"storeId":"3a7aa510-dd2d-11ed-afa1-0242ac120002","courseUid":"471055da-e3a0-40d6-99e4-aecf79b07b0f","checkNumber":"90089"}
{"siteUID":"ab39d38b-41ec-475a-ada1-f87eb6d51d0e","checkNumber":"15569","openedAt":"2023-09-09","courseNumber":"1"}

I found 95 orchestrations in the last 48 hours where the latency is greater than 60 secs, the ones referenced above the latency is ~30 minutes, and I see 56 instances where the latency was greater than 1000 secs. Can you please look into it asap, this is happening in production.

vany0114 on Sep 9, 2023

@vany0114: Btw the new package is out here: https://durabletaskframework.visualstudio.com/Durable Task Framework CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.0-checkpoint-logs-4

Please make sure you’re getting version 1.4.0-checkpoint-logs-4.

As for your question on the new partition manager, it should be under storageProvider, thanks for catching that bug on our release notes; I’ll fix it now.

I’m planning on using the new partition manager v3 in another func app (JS implementation), however, I see that it’s not supported with a Consumption plan, also wondering if that’s already supported in the JS package.

The new partition manager is language agnostic, so it is available in all languages so long as it is not under the Consumption plan 😃

davidmrdavid on Sep 7, 2023

Interesting. I don’t immediately see the logs I would expect if the auto-mitigation had triggered, so I’m not convinced the graph is showing the auto-mitigation working, but I agree that having a spike in outgoing messages seemed to be a correlated factor as per previous graphs.

Looking at spikes in “partition event log” lengths, I don’t see a spike at around ~11:00 UTC, which would correspond to your EH activity spikes.

However, I do see at 7:48 UTC partition 8 has an event log of ~500MB, which is moderately large. I’ll be investigating that, perhaps it’s a red-herring.

davidmrdavid on Sep 6, 2023

Thanks @vany0114. Let’s monitor the situation for a few days, maybe a bit over a week. If the issue does not repro, then please close this issue. I’ll keep an eye on it. In the meantime, I’ll be working to incorporate the fixes we identified in this thread into the mainline package.

Just to be clear, we do not know for certain if this fix truly addressed the root cause of your problem, as we did not obtain a repro of the issue with the new logs. So we should not let our guard down just yet, but hopefully after a week of monitoring we can be more certain that the issue has been addressed. Thanks!

davidmrdavid on Sep 5, 2023

I’m re-using the same PR I used to generate the private package to prototype this fix: https://github.com/microsoft/durabletask-netherite/pull/301

It’ll need a quick review and then I’ll try to generate a private package

davidmrdavid on Aug 31, 2023

@vany0114:

We already have sufficient evidence to know that the log length size is a problem. What we don’t know is how we get into this situation in the first place: there’s log checkpointing code that should run frequently to prevent this from happening. We are actively going through the code to try and find what conditions could delay or break the checkpointing logic, but they are not obvious to us yet. I agree that ideally we’ll spot the root problem (or come up with an in-code detect-and-react auto-mitigation) without the need for a repro. We’re definitely not waiting for this to simply reproduce before trying to take action.

However, if the issue reproduces, the new logs we’ve added should help point us understand what was happening in the checkpointing codepath and why it’s not working as expected. That should give us some critical visibility into how we get into this bad state.

davidmrdavid on Aug 31, 2023

@vany0114:

What would be the right way then? this issue is hurting us badly and we can’t really afford to stop processing orders that long and all of a sudden.

Changing the taskhub is the right idea, but ideally we should not create multiple taskhubs in the same EventHubs namespace. The easiest way to get this right is to change the EventHubs connection string (that’s what connects to your namespace). If we do this right, it will prevent false positives.

To be clear, having two taskhubs in the same EventHubs namespace is not necessarily a critical error (the application knows to simply ignore / discard messages from the other taskhub) but it can have a temporary effect on performance. By pointing this out, I’m just trying to reduce noise in our logs.

However, in Netherite, it is assumed that any given namespace only has 1 taskhub in it. If this assumption isn’t met, your application may receive messages from other taskhubs of that namespace while it is executing. This isn’t a fatal failure, but it can have a temporary effect on performance: when we detect that we’ve received messages from other taskhubs, we simply throw them away/ delete them. This process of throwing away messages from other taskhubs will continue until we have deleted all in-flight messages from the other taskhub, which could take time.

I think you guys should document that, because that’s not what https://github.com/Azure/azure-functions-durable-extension/issues/2534#issuecomment-1693445197, there’s no way you as a consumer to be aware of it.

Agreed, I don’t think the documentation makes this clear enough.

BTW any update from yesterday?

By far, the clearest error signal from yesterday’s incident, from 3:00 pm to 3:40 pm UTC, was that the application was receiving and having to discard lots of unexpected messages from old taskhubs. This makes sense since there were up to 3 taskhubs at play. So I spent time trying to understand that, until I realized your taskhubs were probably inhabiting the same EventHubs namespace.

I also searched for evidence of a very large “partition event log” but did not immediately find it from 3:00 to 3:40 pm UTC. I’ll do a bit more searching though.

Today I’ll be looking at older repros to see if they also show an exceedingly large partition event log. I’ll also look specifically for “entity locking” logs to see if I can find any operations seem delayed during yesterday’s incident, but that is difficult without instanceIDs to I’ll be timeboxing that particular search.

davidmrdavid on Aug 30, 2023

@davidmrdavid That’s correct, the reason you see 3 is because when the func app was deployed it was using the default one, which is ordersagav2, then when the problem showed up again I realized it was using the ordersagav2 which is the one that was being used when this problem appeared initially, so I put it back again to ordersagaworkaroundv1, then I waited for a while to see if the func app gets recovered but it didn’t, then I had to use a new one ordersagaworkaroundv2, and that’s where it went back to “normal” again.

vany0114 on Aug 29, 2023

@vany0114: yes, that’s correct. On the Netherite front, we have as a key symptom that the “event log” is extremely large, and that points to something going wrong in this “checkpointing state machine code path”. So we’ve added a few extra logs there to try and trace through the code to know why we’re not checkpointing (and therefore resetting the size of) the event log.

That information should help us design a mitigation. We did discuss the possibility of a mitigation attempt with the information we have, but it’s not sufficient at this time. Trying to move this as fast as we can, let me ping you as soon as I’m doing with my smoke test.

davidmrdavid on Aug 28, 2023

@vany0114:

It may, yep, @davidmrdavid Would it help if we give you access to our repo so that you guys can take a look at our implementation, maybe you can spot something wrong we’re doing there?

I’d say let’s not do that just yet, there’s still some more digging we can do with our telemetry. We’ll probably be slower at understanding your codebase than you and your team, so it may not be the most effective path forward. Let me first narrow down on this log growth issue as per our telemetry.

davidmrdavid on Aug 25, 2023

@vany0114:

Yeah the Timeout error in function of resources makes sense, but does it correlate to the instances getting stuck? Last night was the first time the function was timing out that long (like 6 hours in a row, and the traffic was pretty much the same as usual at that time), it usually happened a few times during the day, but the long delays issue it’s been there way before, so while the Timeout error prevents us to process new orders (because it cannot create new orchestrations) the long delays locking entities is still there, I mean, whether or not the function times out.

I’ll need to think about this, I might present some pushback on a future response, but I understand what you’re saying.

I wanted to provide another update on the Netherite findings. So this “event log” is keeping track of the entire state of a partition (all instances, outboxes, queues, etc). We’re supposed to take periodic checkpoints of it so that it’s size is reset at every checkpoint. We take checkpoints whenever the log is >20MB, more than 10k events, or 60s have passed since the last checkpoint.

The fact that it grew so large means that either (a) the checkpointed was delayed (either due to a bug or some downstream dependency failing), or (b) that the log grew extremely quickly.

For (b), one theory is if you have very large inputs or outputs to your entities or DF APIs. Do you know if that could apply to you?

davidmrdavid on Aug 25, 2023

@vany0114: I just heard back from the Netherite SME.

For the incident from 8/22 23:28 -> 8/22 00:30, we had discussed that partition 4 was being re-assigned to different workers rather often and that this can cause a slowdown in processing. The question we had was: why is partition 4 moving between workers so much?

I received an internal query showing that the length of the partition event log replay is massive: (1.7GB) which is highly anomalous and can cause OOMs. Those OOMs can explain the partition being re-assigned. The question we’re investigating now is how could this get so big. I’m asking for the normal size of this log under regular operation, but it should definitely not be in the GB-range.

That’s all I have on that right now. We’re investigating it further.

davidmrdavid on Aug 25, 2023

@vany0114: were you seeing these timeouts when you were using Azure Storage as well? Or did these only come up when you switched to Netherite?

Only when switched to Netherite.

vany0114 on Aug 25, 2023

During that time window, there’s low traffic, I guess at that point the azure function scale controller starts killing idle instances, not sure.

Yep, that’s correct. When there’s low traffic, or low activity of Azure Functions themselves, the Scale Controller will start removing instances.

Hard to tell, how can we determine that? we use a consumption plan, the scaling process is dynamic.

I would expect that your portal or application insights would have a visualization of this, but don’t wory if you don’t have immediate access to this information; for the cases I’ve checked so far, I can already see internally that there’s scale in operations during the time of this incident. I just wanted to triple check if you had observed the same.

Anyways, I’m in the process of gathering up my internal queries to share them with the team for further validation. Still, my current theory is that you’re hitting different problems per backend, but it’s just that they have a similar user-facing consequence. I think this is consistent with the fact that the issue auto-heals in Netherite but it does not in Azure Storage. I’ll provide another update tomorrow.

In the meantime, what’s the frequency of this issue with Netherite? Are you seeing this at random, daily, rarely. That will help me determine next steps.

davidmrdavid on Aug 23, 2023

@nytian unfortunately I don’t have that information since we’re not using storage account anymore, and my understanding is that the V3 partition manager is only for storage account backend.

vany0114 on Aug 21, 2023

Hi, sorry for the late response. We are still working on to identify the root cause.

So first for the Azure Storage Backend with the old partition manager, with the orchestration instance Id you provided, the issue seems like the partition of control queue 01 couldn’t be handed over to another worker for several hours. @davidmrdavid is working on creating a private package to mitigate this temporally in this case and we will give the private pkg to you tomorrow hopefully.

Since the issue hits on both versions of partition manager, we don’t know if the above root cause is the same in the new partition manager. So, could you provide us any orchestration instanced/ TaskHub/ TimeStamp, that hit on this issue with the partition manager V3? Even if the storage account being deleted is fine, since we will keep the Kusto logs in a separate storage account. That would be helpful for me to identity the cause in the partition manager V3 scenario.

nytian on Aug 21, 2023

It’s not intermittent, it was happening basically with every request processed by the orchestrator, here are two more examples where those instances were stuck for hours.

{"id":"3107849517381639","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="} This was suck for more than 10 hours.
{"id":"3107849515304964","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="} and this one more than 6 hours

As a result of that our processed orders rate was affected, you can see here how dramatically it dropped because the orchestrator was holding almost all of them.

@davidmrdavid please let me know if the information provided helps

vany0114 on Aug 8, 2023

Actually, I’m seeing abnormally high message ages in “control queue 3”. Let me drill into that. If that queue is hosting any of the entities you’re trying to lock, maybe that’s to blame.

I’m not able to draw a correlation from this observation to the symptoms you experienced, so let’s ignore this for now.

Also, I’ve been focusing on this set of data, from the other thread:

{"storeId":"615c29df-5e7c-458c-9486-ef9480d2bf3b","courseUid":"c77bb860-04f7-46d3-86d9-d8c0e65fe856","checkNumber":"90007"} same, the instance was stuck 21 minutes:

* @ticketnumberkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @originorderkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @order@{"storeId":"615c29df-5e7c-458c-9486-ef9480d2bf3b","courseUid":"c77bb860-04f7-46d3-86d9-d8c0e65fe856","checkNumber":"90007"}

So that orchestrator is stuck from about ~11:33 to ~11:54, and I see in my telemetry that EntityId @scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1 is the last Entity to respond to that lock request, precisely at about ~11:54, whereas others responded around ~11:33. So the question became: why did it take us so long to acquire the lock on this final entity?

Well, I see that entity is managed by control queue 01. According to our telemetry, control queue 01 was being listened to “VM 19” from ~13:15 to ~13:31. Then, no VM is actively listening to new messages from that queue until ~13:54, when “VM 20” takes over.

To summarize, it appears that no new messages are being processed from that queue from ~13:31 to ~13:54. So now what I need to answer is the following: (1) Why does VM19 stop listening to this queue in the first place? Normally, this is because that VM is in the process of shutting down, another worker may steal ownership of its assigned queues, and therefore the VM will stop listening to new messages. (2) Why does it take so long for VM20 to start listening to queue 01? My guess is that VM 19 took too long to restart, which prevented VM20 from being able to listen to that queue (only one VM can listen to a single queue at a time, otherwise we risk “split brain” issues).

That’s what I’ve been able to find so far. Assuming I understand the problem correctly, this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend.

I’ll continue looking into the logs to answer the (2) open questions above, I’ll keep you posted.

davidmrdavid on Aug 8, 2023

Actually, I’m seeing abnormally high message ages in “control queue 3”. Let me drill into that. If that queue is hosting any of the entities you’re trying to lock, maybe that’s to blame.

davidmrdavid on Aug 7, 2023

@vany0114: that shouldn’t be an issue on our end as our historical telemetry is stored independently of our app’s live status, so you’re good to go ahead 😃.

I’m not certain at this point in time that this is an issue with the storage provider itself, so I can’t comment on whether changing to Netherite will help, but it probably won’t hurt to try it out and may tell us more about the problem. Just an FYI.

davidmrdavid on Aug 7, 2023