azure-functions-durable-extension: Orchestration getting stuck while getting the lock
Description
The full description with code snippets, screenshots, issue samples, etc is here https://github.com/Azure/azure-functions-durable-extension/discussions/2530
Expected behavior
Acquire the lock in seconds at most, not minutes or hours.
Actual behavior
It seems the orchestration is getting stuck while acquiring the locks of the entities intervening in the orchestration.
Known workarounds
Reset the durable storage account and the func app storage account
App Details
- Durable Functions extension version (e.g. v1.8.3): 2.10.0
- Azure Functions runtime version (1.0 or 2.0): 2
- Programming language used: C#
If deployed to Azure
- Timeframe issue observed:
- Function App name:
orders-saga
- Azure region: West-US
- Azure storage account name:
ordersagav2
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 128 (61 by maintainers)
I have a longer update to provide, but I wanted to very quickly let you know that the new logs tell me with high precision what went wrong and I’m working to release another private package.
Once I have it, I’ll post it with my analysis of the most recent incident.
@vany0114: That makes sense. I don’t believe at this time that this is an issue caused by EH throughput.
I’ll keep you posted on this thread regarding a new mitigation package but I don’t think I’ll be able to provide it today: I’m trying to first look at the logs to understand what’s happening here. I want to avoid giving you a package that makes things worse. At this point, I can already confirm again, from telemetry, that the mitigation is working as expected, and I can also see the error you report about client operations getting cancelled due to shutdowns, but I have yet to determine the end-to-end root cause of the shutdown.
I see there was a spike of these issues starting on 9/26, I’m diving deeper into that next.
@vany0114:
The Netherite SME and I have identified an area of the code that we think explains the behavior we’re seeing with your app. To put it succinctly, it’s possible for the “FASTER” key value store (one of our dependencies) to silently refuse to start the checkpointing process, and Netherite is not responding to this. We’re prototyping a patch to account for this possibility and eventually terminate the partition if it’s been over a minute of continuous refusal by FASTER to checkpoint the partition event log (which as you know is otherwise growing too large).
We don’t have proof that this is exactly what is happening with your app, but it’s the most realistic theory that is consistent with the logs we have. If we had a repro with the new package, we’d be able to confirm for certain that this is the problem, but until then, this mitigation is our best bet. I’m prototyping this now and trying to provide it to you as fast as possible.
@vany0114: that test took longer than expected as I needed to remove a rather noisy log from my first private package and then re-test the new payload, but I’m done validating it.
Please utilize this new private package (please ensure the suffix is
-checkpoint-logs-2
instead of-checkpoint-logs-1
): https://durabletaskframework.visualstudio.com/Durable Task Framework CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite.AzureFunctions/overview/1.4.0-checkpoint-logs-2To be able to download this package, you will probably need to add the following source to your
nuget.config
:Please let me know once you deploy a new version of your app with this dependency change, and please let me know once you identify a repro with it (and provide us with the usual identifying information) so that we can take a look at our new telemetry. Thank you so much for your assistance.
Note that I’m not the Netherite expert in our team, but I still wanted to share my observations so far, but note they are subject to correction:
I’m looking at the logs for
{"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"cb03b4f3-a26a-4465-84d1-9e44a7ae65e2","checkNumber":"28584"}
and I’m seeing something that seems rather Netherite specific: the partition hosting EntityId “@ticketnumberkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1” (partition 7) appears to be assigned to workers in quick succession from ~8:30 UTC to ~9:22 UTC.My theory is that since this partition is constantly getting re-assigned to new owner workers, we are not able to effectively process its messages and therefore your locking request is taking a long while to process. At ~9:29, partition 7 finally gets a stable worker assignment and it processes the lock request for “@originorderkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1”, which in turn requests the lock for the next entity “@scheduledkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1” and so on until all entities are locked and the original orchestrator is unblocked. This process occurs quickly once we’re able to obtain the lock for “”@originorderkey@7794ca1e-7e8c-49ef-8e2a-763fc114a99c_2023-08-20_28584-1" at ~9:22, so I’m currently thinking that the frequent re-assignment of partition 7 across workers was to blame for the delay.
I do not know why partition 7 is being re-assigned to workers so often. I will need assistance from the Netherite SME to figure that out.
I’m going to analyze your other Netherite instanceIDs to see if the same behavior is to blame in those ones as well. From there, there next steps should be clearer.
@vany0114: I responded in our new thread: https://github.com/Azure/azure-functions-durable-extension/issues/2603#issuecomment-1751435249
@vany0114: Yes, I’m back since this morning; apologies for the lack of follow up here. I’m going to quickly look at making the mitigation more aggressive (less time to fire) and provide you an update on the restarts after reviewing the telemetry around it. My hunch, based on previous observations, is that the restarts are occurring due to OOMs/CPU consumption in the app, but I have yet to confirm that. If that hunch is correct, upgrading to a larger VM size may be the right solution here; but that’s just my unsubstantiated first impressions.
Hey @vany0114: Just wanted to provide an update here.
I just took a look at the behavior of your app, in relation to this issue, for the last coming days. I notice the mitigation firing somewhat regularly (which isn’t ideal, it tells me your application may be unstable for other reasons) but the mitigation does appear to be doing it’s job and keeping your partition log length in check.
I’ll keep an eye out on its behavior for a few more days. If the issue has not reproduced by Friday EOD, I was hoping to close this issue as well as the support ticket associated with this thread. In the meantime, the team will be working to merge the mitigation we shared with you into the mainline Netherite package. If the issue re-occurs, you can re-open this thread and we’ll continue the conversation, though hopefully that will not be necessary 😃.
The latest private package is now out under version
1.4.0-checkpoint-logs-5
: https://durabletaskframework.visualstudio.com/Durable Task Framework CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite.AzureFunctions/overview/1.4.0-checkpoint-logs-5As for the analysis:
I visualized your partition log lengths over the past 2 days and encountered 2 peaks. The first peak is for partition 10, which has a log length of ~1.4GB at 9/9 9:44, and then we have partition 9 which has a log length of ~4GB at 9/10 11:59. This is broadly visualized in the graph below.
I decided to focus on the second peak, for partition 9.
Narrowing in on that, I decided to check how many times per hour the checkpoint state machine returns to its initial state, which means a full checkpoint occurred. It’s clear from this that checkpointing stopped occurring some time after 9/9 16:00 until 9/10 12:00. Below is a table showing the number of times checkpointing occured per 1 hour bins from 9/9 13:00 to 9/10 13:00. From there, you can see there’s several hours without any checkpoints between 9/9 16:00 to 9/10 12:00.
<html> <body> <span><span class="ui-provider ec bfa bba cks ckt cku ckv ckw ckx cky ckz cla clb clc cld cle clf clg clh cli clj clk cll clm cln clo clp clq clr cls clt clu clv clw clx" dir="ltr"></span></span>
</body> </html>Now this is where the custom logs come in to help us dig deeper, but first I need to clarify that this checkpointing logic has 3 steps: (1) the “compaction step” (2) the index checkpoint (3) the store checkpoint.
When this investigation began, we knew were getting stuck in the checkpoint codepath, but were unable to determine what step was to blame. Our first fix targetted steps (2) and (3) because they seemed the most suspect, but after this message it became clear to us that it was actually the “compaction” step that was getting stuck. So the package with version
1.4.0-checkpoint-logs-4
contained detection logic to identify this case and in response it would try to terminate all active processing for to this partition so that it could restart.In the logs I have for this latest reproduction, I do see the detection logic firing, but I made the mistake of failing to guarantee that all resources held by that partition were released / discarded / free’d, which nullified the effect of the mitigation. That was my mistake. In my latest release, I forced a local reproduction of the error and checked to make sure all the expected resources got free’d up.
All that being said: I realize this issue has been going on for a bit now so I appreciate your cooperation with us. Please do try out the latest package for stability and telemetry’s sake. We’ll continue monitoring the situation from there on.
One final concern I have is why your application is able to trigger this issue so frequently, whereas others aren’t. I haven’t been able to figure that out, but my first hunch (which I’d need to confirm with the Netherite SME) is that it may be related to very low CPU availability for checkpointing, as I would expect compaction to be a CPU heavy operation; but that is to be confirmed.
@davidmrdavid the latency issue locking the entities showed up again 😢 These are some orchestrations affected:
{"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"d56a1e68-232d-4d78-aa4d-9b60ac020e9b","checkNumber":"18559"}
{"storeId":"3a7aa510-dd2d-11ed-afa1-0242ac120002","courseUid":"471055da-e3a0-40d6-99e4-aecf79b07b0f","checkNumber":"90089"}
{"siteUID":"ab39d38b-41ec-475a-ada1-f87eb6d51d0e","checkNumber":"15569","openedAt":"2023-09-09","courseNumber":"1"}
I found 95 orchestrations in the last 48 hours where the latency is greater than 60 secs, the ones referenced above the latency is ~30 minutes, and I see 56 instances where the latency was greater than 1000 secs. Can you please look into it asap, this is happening in production.
@vany0114: Btw the new package is out here: https://durabletaskframework.visualstudio.com/Durable Task Framework CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.0-checkpoint-logs-4
Please make sure you’re getting version
1.4.0-checkpoint-logs-4
.As for your question on the new partition manager, it should be under
storageProvider
, thanks for catching that bug on our release notes; I’ll fix it now.The new partition manager is language agnostic, so it is available in all languages so long as it is not under the Consumption plan 😃
Interesting. I don’t immediately see the logs I would expect if the auto-mitigation had triggered, so I’m not convinced the graph is showing the auto-mitigation working, but I agree that having a spike in outgoing messages seemed to be a correlated factor as per previous graphs.
Looking at spikes in “partition event log” lengths, I don’t see a spike at around ~11:00 UTC, which would correspond to your EH activity spikes.
However, I do see at 7:48 UTC partition 8 has an event log of ~500MB, which is moderately large. I’ll be investigating that, perhaps it’s a red-herring.
Thanks @vany0114. Let’s monitor the situation for a few days, maybe a bit over a week. If the issue does not repro, then please close this issue. I’ll keep an eye on it. In the meantime, I’ll be working to incorporate the fixes we identified in this thread into the mainline package.
Just to be clear, we do not know for certain if this fix truly addressed the root cause of your problem, as we did not obtain a repro of the issue with the new logs. So we should not let our guard down just yet, but hopefully after a week of monitoring we can be more certain that the issue has been addressed. Thanks!
I’m re-using the same PR I used to generate the private package to prototype this fix: https://github.com/microsoft/durabletask-netherite/pull/301
It’ll need a quick review and then I’ll try to generate a private package
@vany0114:
We already have sufficient evidence to know that the log length size is a problem. What we don’t know is how we get into this situation in the first place: there’s log checkpointing code that should run frequently to prevent this from happening. We are actively going through the code to try and find what conditions could delay or break the checkpointing logic, but they are not obvious to us yet. I agree that ideally we’ll spot the root problem (or come up with an in-code detect-and-react auto-mitigation) without the need for a repro. We’re definitely not waiting for this to simply reproduce before trying to take action.
However, if the issue reproduces, the new logs we’ve added should help point us understand what was happening in the checkpointing codepath and why it’s not working as expected. That should give us some critical visibility into how we get into this bad state.
@vany0114:
Changing the taskhub is the right idea, but ideally we should not create multiple taskhubs in the same EventHubs namespace. The easiest way to get this right is to change the EventHubs connection string (that’s what connects to your namespace). If we do this right, it will prevent false positives.
To be clear, having two taskhubs in the same EventHubs namespace is not necessarily a critical error (the application knows to simply ignore / discard messages from the other taskhub) but it can have a temporary effect on performance. By pointing this out, I’m just trying to reduce noise in our logs.
Agreed, I don’t think the documentation makes this clear enough.
By far, the clearest error signal from yesterday’s incident, from 3:00 pm to 3:40 pm UTC, was that the application was receiving and having to discard lots of unexpected messages from old taskhubs. This makes sense since there were up to 3 taskhubs at play. So I spent time trying to understand that, until I realized your taskhubs were probably inhabiting the same EventHubs namespace.
I also searched for evidence of a very large “partition event log” but did not immediately find it from 3:00 to 3:40 pm UTC. I’ll do a bit more searching though.
Today I’ll be looking at older repros to see if they also show an exceedingly large partition event log. I’ll also look specifically for “entity locking” logs to see if I can find any operations seem delayed during yesterday’s incident, but that is difficult without instanceIDs to I’ll be timeboxing that particular search.
@davidmrdavid That’s correct, the reason you see 3 is because when the func app was deployed it was using the default one, which is
ordersagav2
, then when the problem showed up again I realized it was using theordersagav2
which is the one that was being used when this problem appeared initially, so I put it back again toordersagaworkaroundv1
, then I waited for a while to see if the func app gets recovered but it didn’t, then I had to use a new oneordersagaworkaroundv2
, and that’s where it went back to “normal” again.@vany0114: yes, that’s correct. On the Netherite front, we have as a key symptom that the “event log” is extremely large, and that points to something going wrong in this “checkpointing state machine code path”. So we’ve added a few extra logs there to try and trace through the code to know why we’re not checkpointing (and therefore resetting the size of) the event log.
That information should help us design a mitigation. We did discuss the possibility of a mitigation attempt with the information we have, but it’s not sufficient at this time. Trying to move this as fast as we can, let me ping you as soon as I’m doing with my smoke test.
@vany0114:
I’d say let’s not do that just yet, there’s still some more digging we can do with our telemetry. We’ll probably be slower at understanding your codebase than you and your team, so it may not be the most effective path forward. Let me first narrow down on this log growth issue as per our telemetry.
@vany0114:
I’ll need to think about this, I might present some pushback on a future response, but I understand what you’re saying.
I wanted to provide another update on the Netherite findings. So this “event log” is keeping track of the entire state of a partition (all instances, outboxes, queues, etc). We’re supposed to take periodic checkpoints of it so that it’s size is reset at every checkpoint. We take checkpoints whenever the log is >20MB, more than 10k events, or 60s have passed since the last checkpoint.
The fact that it grew so large means that either (a) the checkpointed was delayed (either due to a bug or some downstream dependency failing), or (b) that the log grew extremely quickly.
For (b), one theory is if you have very large inputs or outputs to your entities or DF APIs. Do you know if that could apply to you?
@vany0114: I just heard back from the Netherite SME.
For the incident from 8/22 23:28 -> 8/22 00:30, we had discussed that partition 4 was being re-assigned to different workers rather often and that this can cause a slowdown in processing. The question we had was: why is partition 4 moving between workers so much?
I received an internal query showing that the length of the partition event log replay is massive: (1.7GB) which is highly anomalous and can cause OOMs. Those OOMs can explain the partition being re-assigned. The question we’re investigating now is how could this get so big. I’m asking for the normal size of this log under regular operation, but it should definitely not be in the GB-range.
That’s all I have on that right now. We’re investigating it further.
Only when switched to Netherite.
Yep, that’s correct. When there’s low traffic, or low activity of Azure Functions themselves, the Scale Controller will start removing instances.
I would expect that your portal or application insights would have a visualization of this, but don’t wory if you don’t have immediate access to this information; for the cases I’ve checked so far, I can already see internally that there’s scale in operations during the time of this incident. I just wanted to triple check if you had observed the same.
Anyways, I’m in the process of gathering up my internal queries to share them with the team for further validation. Still, my current theory is that you’re hitting different problems per backend, but it’s just that they have a similar user-facing consequence. I think this is consistent with the fact that the issue auto-heals in Netherite but it does not in Azure Storage. I’ll provide another update tomorrow.
In the meantime, what’s the frequency of this issue with Netherite? Are you seeing this at random, daily, rarely. That will help me determine next steps.
@nytian unfortunately I don’t have that information since we’re not using storage account anymore, and my understanding is that the V3 partition manager is only for storage account backend.
Hi, sorry for the late response. We are still working on to identify the root cause.
So first for the Azure Storage Backend with the old partition manager, with the orchestration instance Id you provided, the issue seems like the partition of control queue 01 couldn’t be handed over to another worker for several hours. @davidmrdavid is working on creating a private package to mitigate this temporally in this case and we will give the private pkg to you tomorrow hopefully.
Since the issue hits on both versions of partition manager, we don’t know if the above root cause is the same in the new partition manager. So, could you provide us any orchestration instanced/ TaskHub/ TimeStamp, that hit on this issue with the partition manager V3? Even if the storage account being deleted is fine, since we will keep the Kusto logs in a separate storage account. That would be helpful for me to identity the cause in the partition manager V3 scenario.
It’s not intermittent, it was happening basically with every request processed by the orchestrator, here are two more examples where those instances were stuck for hours.
{"id":"3107849517381639","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="}
This was suck for more than 10 hours.{"id":"3107849515304964","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="}
and this one more than 6 hoursAs a result of that our processed orders rate was affected, you can see here how dramatically it dropped because the orchestrator was holding almost all of them.
@davidmrdavid please let me know if the information provided helps
I’m not able to draw a correlation from this observation to the symptoms you experienced, so let’s ignore this for now.
Also, I’ve been focusing on this set of data, from the other thread:
So that orchestrator is stuck from about ~11:33 to ~11:54, and I see in my telemetry that EntityId
@scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
is the last Entity to respond to that lock request, precisely at about ~11:54, whereas others responded around ~11:33. So the question became: why did it take us so long to acquire the lock on this final entity?Well, I see that entity is managed by control queue 01. According to our telemetry, control queue 01 was being listened to “VM 19” from ~13:15 to ~13:31. Then, no VM is actively listening to new messages from that queue until ~13:54, when “VM 20” takes over.
To summarize, it appears that no new messages are being processed from that queue from ~13:31 to ~13:54. So now what I need to answer is the following: (1) Why does VM19 stop listening to this queue in the first place? Normally, this is because that VM is in the process of shutting down, another worker may steal ownership of its assigned queues, and therefore the VM will stop listening to new messages. (2) Why does it take so long for VM20 to start listening to queue 01? My guess is that VM 19 took too long to restart, which prevented VM20 from being able to listen to that queue (only one VM can listen to a single queue at a time, otherwise we risk “split brain” issues).
That’s what I’ve been able to find so far. Assuming I understand the problem correctly, this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend.
I’ll continue looking into the logs to answer the (2) open questions above, I’ll keep you posted.
Actually, I’m seeing abnormally high message ages in “control queue 3”. Let me drill into that. If that queue is hosting any of the entities you’re trying to lock, maybe that’s to blame.
@vany0114: that shouldn’t be an issue on our end as our historical telemetry is stored independently of our app’s live status, so you’re good to go ahead 😃.
I’m not certain at this point in time that this is an issue with the storage provider itself, so I can’t comment on whether changing to Netherite will help, but it probably won’t hurt to try it out and may tell us more about the problem. Just an FYI.