azure-functions-durable-extension: Memory leak when debugger attached
Test Setup
I created a function app which fans out to 1,000 sub-orchestrations (which each do a simple hello world sequence) and had it run locally for several hours using ContinueAsNew
.
Top-Level Orchestration
The top-most orchestration calls a single sub-orchestration in a ContinueAsNew
loop:
[FunctionName(nameof(RunContinuousSubOrchestratorFanOut))]
public static async Task RunContinuousSubOrchestratorFanOut(
[OrchestrationTrigger] DurableOrchestrationContext ctx,
TraceWriter log)
{
DateTime startTime = ctx.CurrentUtcDateTime;
await ctx.CallSubOrchestratorAsync(nameof(SubOrchestratorFanOut), 1000);
// Wait 2 minutes before starting the next run
if (!ctx.IsReplaying) log.Info($"Sleeping for 2 minutes.");
await ctx.CreateTimer(ctx.CurrentUtcDateTime.AddMinutes(2), CancellationToken.None);
ctx.ContinueAsNew(parallelSubOrchestrations);
}
Fan-Out Orchestration
This orchestration fans out to 1,000 sub-orchestrations.
[FunctionName(nameof(SubOrchestratorFanOut))]
public static async Task SubOrchestratorFanOut([OrchestrationTrigger] DurableOrchestrationContext ctx)
{
int parallelSubOrchestrations = ctx.GetInput<int>();
var subOrchestrationList = new List<Task>();
IEnumerable<Task> subOrchestrationTasks = Enumerable.Range(0, parallelSubOrchestrations).Select(i =>
ctx.CallSubOrchestratorAsync(
functionName: nameof(HelloSequenceOrchestration.HelloSequence),
input: i));
Task allSubOrchestrationsCompletedTask = Task.WhenAll(subOrchestrationTasks);
using (var cts = new CancellationTokenSource())
{
Task timeoutTask = ctx.CreateTimer(ctx.CurrentUtcDateTime.AddHours(1), cts.Token);
if (timeoutTask == await Task.WhenAny(allSubOrchestrationsCompletedTask, timeoutTask))
{
throw new TimeoutException("One or more sub-orchestrations failed to complete within an hour!");
}
else
{
cts.Cancel();
}
}
}
Child Orchestration (simple sequence)
This is the inner-most orchestration, which is a simple sequence of activity functions.
[FunctionName(nameof(HelloSequence))]
public static async Task<List<string>> HelloSequence(
[OrchestrationTrigger] DurableOrchestrationContextBase context)
{
var outputs = new List<string>();
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Tokyo"));
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Seattle"));
outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "London"));
// returns ["Hello Tokyo!", "Hello Seattle!", "Hello London!"]
return outputs;
}
Results
After a few hours, I observed several OutOfMemoryException
errors, and noticed that the process working set of func.exe was growing continuously throughout the run.
Here is the callstack:
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at System.Threading.Tasks.Task.AddToActiveTasks(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents(Task task, Action continuation)
at System.Runtime.CompilerServices.TaskAwaiter.OnCompletedInternal(Task task, Action continuation, Boolean continueOnCapturedContext, Boolean flowExecutionContext)
at System.Runtime.CompilerServices.TaskAwaiter`1.UnsafeOnCompleted(Action continuation)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<>c.<ThrowAsync>b__6_1(Object state)
at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
This callstack seems to point to Task.s_currentActiveTasks as being unable to expand. This dictionary is responsible for tracking all currently running tasks in the app domain.
Investigation
Looking at allocations in the Visual Studio debugger, I noticed that there seem to be a large number of tasks that are not getting garbage collected. It’s my understanding that task-related memory leaks are possible when tasks are created but never completed. I believe that is exactly what TaskOrchestrationExecutor does – starts tasks, saves them in the TaskOrchestrationExecutor.result field, and never completes them. The executor objects obviously can be GC’d, but I don’t think that is sufficient to ensure the Task objects get GC’d.
Workaround
The root cause appears to be related to tasks that are created but never completed. This only happens when doing orchestration replay. For that reason, this can be partially mitigated by enabling extended sessions, as described in the Performance & Scale documentation: https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-perf-and-scale#orchestrator-function-replay
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 2
- Comments: 21 (4 by maintainers)
Commits related to this issue
- Address OutOfMemory exception on durable tasks Durable function apps can throw out of memory exceptions if there's a lot of data being passed from an orchestrator to an activity. This is one recommen... — committed to indiana-university/itpeople-functions-v2 by jhoerr 3 years ago
- Address OutOfMemory exception on durable tasks (#17) Durable function apps can throw out of memory exceptions if there's a lot of data being passed from an orchestrator to an activity. This is one re... — committed to indiana-university/itpeople-functions-v2 by jhoerr 3 years ago
Here was my app without a debugger attached running for two hours:
Here was my app with a debugger attached:
Note that attaching the debugger substantially slowed down the processing of suborchestrations, so that steady memory increase we had was over far fewer
ContinueAsNew
iterations.I think this very clearly shows that the memory leak in this sample only really happens with the debugger attached, as expected with the
s_currentActiveTasks
being the culprit. I will change the name of the issue accordingly to accurately reflect the issue.I think that in order to fix this, we would need to work with the Roslyn team. In the GitHub issue associated with my previously linked stack overflow question, Stephen Toub commented “Regardless of whether the implementation or debugger is doing the wrong thing here around explicitly removing the task from the debug-only s_currentActiveTasks table, we should change s_currentActiveTasks to be a ConditionalWeakTable or similar rather than a Dictionary, so that it’s not rooting tasks unnecessarily.” It appears that this value is still a Dictionary when I looked at the current source code, so we can follow up with that team to see if making that change would fix this issue.
In the meantime, I’m going to remove this from the high priority milestone, because a memory leak in debugger scenarios is far less concerning than a more general memory leak.
For anyone experiencing memory issues that finds this issue, I highly recommend reading this section of our performance docs. If you are finding other evidence of a memory leak running when a debugger is not attached, please file a new issue and we will investigate there!
Ok, so I ran this with the VS 2019 Performance Profile (memory-only) with a debug-build with almost the exact code in Chris’s sample on Functions V3 and v2.5.1 of the extension. The only caveat is I had to translate the Durable 1.x code to Durable 2.x, and remove the
parallelSubOrchestrations
parameter fromContinueAsNew()
, because I couldn’t identify where it was being created or consumed. I used a debug build.After 2 hours, the app essentially has remained oscillating between ~160 MB and ~210 MB, with it mostly hovering around 175MB during the two minute sleep window.
This seems to confirm that without a debugger firmly attached, we don’t seem to experience a memory leak (or if we do, it is much more minor than the one originally reported).
I will try attaching a debugger to see if memory usage starts to grow.
Ok, so I have taken a closer look at
Task.s_currentActiveTasks
to see how we could resolve this issue. Looking into the stack overflow comments for this question, it looks likes_currentActiveTasks
is only used by the debugger, which means this wouldn’t impact production applications.I am going to try reproing this both locally and in production to see in what scenarios a memory-leak is present.
Hi @cgillum I was able to reproduce memory leak issue even using fan-out for activity functions. Please see code sample below:
host.json file:
Seems that the list of
HistoryEvent
objects retained even after orchestration completed.Just in case, I’ve attached two memory snapshots to be able to do diff https://www.dropbox.com/s/p1idbb7lqw54062/Snapshots.zip?dl=0