azure-functions-durable-extension: Memory leak when debugger attached

Test Setup

I created a function app which fans out to 1,000 sub-orchestrations (which each do a simple hello world sequence) and had it run locally for several hours using ContinueAsNew.

Top-Level Orchestration

The top-most orchestration calls a single sub-orchestration in a ContinueAsNew loop:

  [FunctionName(nameof(RunContinuousSubOrchestratorFanOut))]
  public static async Task RunContinuousSubOrchestratorFanOut(
      [OrchestrationTrigger] DurableOrchestrationContext ctx,
      TraceWriter log)
  {
      DateTime startTime = ctx.CurrentUtcDateTime;
      await ctx.CallSubOrchestratorAsync(nameof(SubOrchestratorFanOut), 1000);

      // Wait 2 minutes before starting the next run
      if (!ctx.IsReplaying) log.Info($"Sleeping for 2 minutes.");
      await ctx.CreateTimer(ctx.CurrentUtcDateTime.AddMinutes(2), CancellationToken.None);
      ctx.ContinueAsNew(parallelSubOrchestrations);
  }

Fan-Out Orchestration

This orchestration fans out to 1,000 sub-orchestrations.

  [FunctionName(nameof(SubOrchestratorFanOut))]
  public static async Task SubOrchestratorFanOut([OrchestrationTrigger] DurableOrchestrationContext ctx)
  {
      int parallelSubOrchestrations = ctx.GetInput<int>();

      var subOrchestrationList = new List<Task>();
      IEnumerable<Task> subOrchestrationTasks = Enumerable.Range(0, parallelSubOrchestrations).Select(i => 
          ctx.CallSubOrchestratorAsync(
              functionName: nameof(HelloSequenceOrchestration.HelloSequence),
              input: i));

      Task allSubOrchestrationsCompletedTask = Task.WhenAll(subOrchestrationTasks);

      using (var cts = new CancellationTokenSource())
      {
          Task timeoutTask = ctx.CreateTimer(ctx.CurrentUtcDateTime.AddHours(1), cts.Token);
          if (timeoutTask == await Task.WhenAny(allSubOrchestrationsCompletedTask, timeoutTask))
          {
              throw new TimeoutException("One or more sub-orchestrations failed to complete within an hour!");
          }
          else
          {
              cts.Cancel();
          }
      }
  }

Child Orchestration (simple sequence)

This is the inner-most orchestration, which is a simple sequence of activity functions.

[FunctionName(nameof(HelloSequence))]
public static async Task<List<string>> HelloSequence(
    [OrchestrationTrigger] DurableOrchestrationContextBase context)
{
    var outputs = new List<string>();

    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Tokyo"));
    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Seattle"));
    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "London"));

    // returns ["Hello Tokyo!", "Hello Seattle!", "Hello London!"]
    return outputs;
}

Results

After a few hours, I observed several OutOfMemoryException errors, and noticed that the process working set of func.exe was growing continuously throughout the run.

Here is the callstack:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Threading.Tasks.Task.AddToActiveTasks(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.OutputWaitEtwEvents(Task task, Action continuation)
   at System.Runtime.CompilerServices.TaskAwaiter.OnCompletedInternal(Task task, Action continuation, Boolean continueOnCapturedContext, Boolean flowExecutionContext)
   at System.Runtime.CompilerServices.TaskAwaiter`1.UnsafeOnCompleted(Action continuation)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<>c.<ThrowAsync>b__6_1(Object state)
   at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

This callstack seems to point to Task.s_currentActiveTasks as being unable to expand. This dictionary is responsible for tracking all currently running tasks in the app domain.

Investigation

Looking at allocations in the Visual Studio debugger, I noticed that there seem to be a large number of tasks that are not getting garbage collected. It’s my understanding that task-related memory leaks are possible when tasks are created but never completed. I believe that is exactly what TaskOrchestrationExecutor does – starts tasks, saves them in the TaskOrchestrationExecutor.result field, and never completes them. The executor objects obviously can be GC’d, but I don’t think that is sufficient to ensure the Task objects get GC’d.

Workaround

The root cause appears to be related to tasks that are created but never completed. This only happens when doing orchestration replay. For that reason, this can be partially mitigated by enabling extended sessions, as described in the Performance & Scale documentation: https://docs.microsoft.com/en-us/azure/azure-functions/durable-functions-perf-and-scale#orchestrator-function-replay

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 2
  • Comments: 21 (4 by maintainers)

Commits related to this issue

Most upvoted comments

Here was my app without a debugger attached running for two hours:

image

Here was my app with a debugger attached:

image

Note that attaching the debugger substantially slowed down the processing of suborchestrations, so that steady memory increase we had was over far fewer ContinueAsNew iterations.

I think this very clearly shows that the memory leak in this sample only really happens with the debugger attached, as expected with the s_currentActiveTasks being the culprit. I will change the name of the issue accordingly to accurately reflect the issue.

I think that in order to fix this, we would need to work with the Roslyn team. In the GitHub issue associated with my previously linked stack overflow question, Stephen Toub commented “Regardless of whether the implementation or debugger is doing the wrong thing here around explicitly removing the task from the debug-only s_currentActiveTasks table, we should change s_currentActiveTasks to be a ConditionalWeakTable or similar rather than a Dictionary, so that it’s not rooting tasks unnecessarily.” It appears that this value is still a Dictionary when I looked at the current source code, so we can follow up with that team to see if making that change would fix this issue.

In the meantime, I’m going to remove this from the high priority milestone, because a memory leak in debugger scenarios is far less concerning than a more general memory leak.

For anyone experiencing memory issues that finds this issue, I highly recommend reading this section of our performance docs. If you are finding other evidence of a memory leak running when a debugger is not attached, please file a new issue and we will investigate there!

Ok, so I ran this with the VS 2019 Performance Profile (memory-only) with a debug-build with almost the exact code in Chris’s sample on Functions V3 and v2.5.1 of the extension. The only caveat is I had to translate the Durable 1.x code to Durable 2.x, and remove the parallelSubOrchestrations parameter from ContinueAsNew(), because I couldn’t identify where it was being created or consumed. I used a debug build.

After 2 hours, the app essentially has remained oscillating between ~160 MB and ~210 MB, with it mostly hovering around 175MB during the two minute sleep window.

This seems to confirm that without a debugger firmly attached, we don’t seem to experience a memory leak (or if we do, it is much more minor than the one originally reported).

I will try attaching a debugger to see if memory usage starts to grow.

Ok, so I have taken a closer look at Task.s_currentActiveTasks to see how we could resolve this issue. Looking into the stack overflow comments for this question, it looks like s_currentActiveTasks is only used by the debugger, which means this wouldn’t impact production applications.

I am going to try reproing this both locally and in production to see in what scenarios a memory-leak is present.

Hi @cgillum I was able to reproduce memory leak issue even using fan-out for activity functions. Please see code sample below:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using MoreLinq;

namespace Company.FunctionApp
{
    public class DurableFunctionsOrchestration
    {
        [FunctionName(nameof(StartHttp))]
        public async Task<HttpResponseMessage> StartHttp(
            [HttpTrigger(AuthorizationLevel.Anonymous, "get", "post")]
            HttpRequestMessage request,
            [OrchestrationClient] DurableOrchestrationClient client)
        {
            await client.StartNewAsync(nameof(MigrateToSdmOrchestrator), null);

            return new HttpResponseMessage(HttpStatusCode.OK);
        }

        [FunctionName(nameof(MigrateToSdmOrchestrator))]
        public async Task MigrateToSdmOrchestrator([OrchestrationTrigger] DurableOrchestrationContext context)
        {
            var batches = await context.CallActivityAsync<IReadOnlyCollection<VolumesBatch>>(
                nameof(GetUploadedBatchesActivity),
                null);

            var uniqueBatchIds = batches
                .Select(m => m.OriginalBatch)
                .GroupBy(b => (b.Country, b.VeType, b.VoType))
                .Select(g => g.OrderBy(b => b.BatchId).First().BatchId)
                .ToList();

            var subBatches = batches
                .Where(m => uniqueBatchIds.Contains(m.OriginalBatch.BatchId))
                .OrderBy(b => b.OriginalBatch.BatchId)
                .ThenBy(b => b.Year)
                .ThenBy(b => b.Month);

            var routines = subBatches
                .Select(batch => context.CallSubOrchestratorAsync(
                    nameof(WriteSubBatchToSdmActivity),
                    batch));

            await Task.WhenAll(routines);
        }

        [FunctionName(nameof(WriteSubBatchToSdmActivity))]
        public async Task WriteSubBatchToSdmActivity([ActivityTrigger] VolumesBatch batch)
        {
            await Console.Out.WriteLineAsync($"Processing batch: {batch.Id}");
            await Task.Delay(3000);
        }

        [FunctionName(nameof(GetUploadedBatchesActivity))]
        public Task<IReadOnlyCollection<VolumesBatch>> GetUploadedBatchesActivity(
            [ActivityTrigger] DurableActivityContext context)
        {
            return Task.FromResult<IReadOnlyCollection<VolumesBatch>>(Enumerable.Range(0, 1000)
                .Select(CreateBatch)
                .ToList()
                .AsReadOnly());
        }

        private VolumesBatch CreateBatch(int index)
        {
            return new VolumesBatch(
                Guid.NewGuid(),
                index,
                index,
                $"Region {index}",
                String.Empty,
                DateTime.UtcNow,
                DateTime.UtcNow,
                DateTime.UtcNow,
                new HeritageBatch
                {
                    Country = $"Country {index}",
                    Created = DateTime.UtcNow,
                    BatchId = index / 2,
                    Ready = DateTime.UtcNow,
                    EndDate = DateTime.UtcNow,
                    StartDate = DateTime.UtcNow,
                    VeType = $"Ve Type {index}",
                    VoType = $"Vo Type {index}"
                });
        }
    }
}

host.json file:

"extensions": {
        "durableTask": {
            "logReplayEvents": false,
            "controlQueueBatchSize": 4,
            "traceInputsAndOutputs": true,
            "maxConcurrentActivityFunctions": 2,
            "maxConcurrentOrchestratorFunctions": 2,
            "maxQueuePollingInterval": "00:00:10"
        }
    }

Seems that the list of HistoryEvent objects retained even after orchestration completed.

image

Just in case, I’ve attached two memory snapshots to be able to do diff https://www.dropbox.com/s/p1idbb7lqw54062/Snapshots.zip?dl=0