dd-trace-dotnet: Memory leak when using MongoDb integration

Describe the bug When the MongoDb integrations are present in the dotnet-integrations.json, our memory usage slowly increases at an approximate average rate of 5MB per hour. This continues until it has accumulated enough memory for our tasks in ECS/Fargate to run out of memory and kill/restart our containers.

To Reproduce Steps to reproduce the behavior:

  1. Create a task in AWS ECS/Fargate with a .NET 5 service auto-instrumented with dd-trace-dotnet and standard dotnet-integrations.json
  2. Hook up the periodic health check to run the Ping command on the Mongo server
  3. Let the service sit idle for many days
  4. See the memory usage of the task increase over time from within CloudWatch (or other monitoring tool)

Screenshots DdTraceDotnet-MongoMemoryLeak-1

  • Start to A = services sitting idle with version 1.26.1 and the full contents of datadog-integration.json
  • A = ECS/Fargate killed and restarted the tasks due to out of memory
  • A to B = no change, services eating memory and sitting idle after the restart
  • B to C = irrelevant tests
  • C = manual restart with a completely empty datadog-integration.json (only contents is an empty json array [])
  • C to D = sitting idle for a day with very little increase in memory
  • D = manual restart with all contents of datadog-integrations.json restored, but removed the MongoDb sections
  • D’ = enabled a Hangfire job that runs periodically to execute MongoDb queries
  • D to E = mostly sitting idle with very little increase in memory
  • E = manual restart with all contents of datadog-integrations.json restored (including MongoDb sections)
  • E to F = sitting idle with memory increase
  • F = manual restart after deploy with upgrade to 1.28.0 and updated datadog-integrations.json at that tag (includes MongoDb sections)
  • F to now = sitting idle and looks like memory increase is continuing with latest

Runtime environment (please complete the following information):

  • Instrumentation mode: Automatic with Debian APM installed in container
  • Tracer version: 1.26.1 and 1.28.0
  • OS: Container based on image mcr.microsoft.com/dotnet/aspnet:5.0 which is a Debian Buster-Slim distro
  • CLR: .NET 5

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (7 by maintainers)

Most upvoted comments

Thanks @hiiru for providing all the extra details, this is really useful.

The source of the problem

We think we’ve tracked down the source of the issue, unfortunately there appears to be multiple facets to this.

  1. The MongoClient (incorrectly) flows the execution context from the current thread when it is being created. That means if there’s an active trace at the time the MongoClient is created, the background thread that checks the status of the cluster inherits that active trace. Even if the scope is closed on another thread, the background thread will keep adding spans to the trace.
  2. If you register your MongoClient as a singleton service in ASP.NET Core’s DI container, it is “lazily” created when it is first needed. That’s very application-dependent, but it’s likely to be inside a request trace or something similar, which means it’s likely to capture a trace.
  3. If the parent trace is “long-lived”, as opposed to a typical “short-lived” request trace, then the trace will not be flushed, and the background thread will continue adding spans to it, causing the memory leak.

Solutions

Depending on the exact behaviour you’re seeing, and the nature of the issue, there are various ways to mitigate/fix the issue.

1. Stop flowing execution context in MongoClient

This is the real solution to the problem, which would involve an update to the Mongo Csharp Driver to prevent flowing the execution context. I’d suggest filing a support case with Mongo - as a customer hopefully this will be prioritised!

2. Create the MongoClient ahead of time

Depending on your application, you may find that you can solve the problem by registering your singleton MongoClient instance in DI directly. For example, if you were previously doing this:

public void ConfigureServices(IServiceCollection services)
{
    services.AddSingleton<IMongoClient>(_ => new MongoClient(Configuration.GetConnectionString("mongo")));
}

Do this instead:

public void ConfigureServices(IServiceCollection services)
{
    var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
    services.AddSingleton<IMongoClient>(mongoClient);
}

This ensures that the context is captured when building Startup, instead of when the client is first used. As long as you aren’t creating long-lived scopes that encompass this part of the app lifecycle (which we would not recommend anyway) then this may solve your issue.

More generally, if you’re not using this DI pattern, ensure that you’re not creating the MongoClient inside a “long-lived” trace/scope. If you are creating the client inside an existing scope, you can ensure you don’t capture the context by calling ExecutionContext.SuppressFlow(), for example:

public void ConfigureServices(IServiceCollection services)
{
    using(_ = System.Threading.ExecutionContext.SuppressFlow())
    {
        var mongoClient = new MongoClient(Configuration.GetConnectionString("mongo"));
    }
    services.AddSingleton<IMongoClient>(mongoClient);
}

3. Enable partial flush

We generally advise against creating long-lived traces. By design, traces remain in memory until the last span is closed, so if you have a long-lived trace with many child spans, these will use an increasing amount of memory as the app lifetime increases.

In some cases, you can mitigate this issue by enabling partial flush for the .NET Tracer, as described in the documentation. If the above solution isn’t possible or doesn’t resolve the issue, it may be worth trying DD_TRACE_PARTIAL_FLUSH_ENABLED=true.

Other possible enhancements

There seems to be little utility in seeing these infrastructural isMaster spans in DataDog, as they mostly seem to clutter up traces (completely aside from the memory issue described). Would you be interested in an option in the tracer that exclude these spans from automatic instrumentation by default (and other “infrastructural” commands), allowing enabling them with an environment variable?

After 2 days with the SuppressFlow fix, the memory still looks good. This definitely fixed the issue. 👍 image

@andrewlock Thank you for the solution and your MongoDb PR to fix it.

I went with both suppressing execution context flow and DD_TRACE_PARTIAL_FLUSH_ENABLED=true and I re-enabled the MongoDB integration in the integrations.json with version 1.28.2 of the library. With those both in place, I do not see the memory leak!

Also, we use SignalR. By default, each SignalR connection (per user) has one long running trace where each call to the SignalR hub is a span and the trace doesn’t complete until the user disconnects (which could be a long time). Turning on DD_TRACE_PARTIAL_FLUSH_ENABLED seems to have dramatically reduced our memory overhead in this SignalR service. So there’s a bonus!

I have traces/spans that run inside Configure and ConfigureServices so I can monitor how long those take, so workaround 2 won’t help me directly. But I also have a clear spot where I create the MongoClient for the first time. If I add a call to ExecutionContext.SuppressFlow() before I create the first MongoClient, would that clear the trace?

Good point, yes, that should work 👍 I’ll update the above comment to mention that as an option for others in the thread.

And is SuppressFlow() what the MongoDB driver team would need to add to fix it?

Yes, I believe so (testing indicates that solves the issue too). I’ve created a PR for the fix but if you could raise a support issue as a customer that will very likely increase priorities there 🙂

Also, I see in the documentation that DD_TRACE_PARTIAL_FLUSH_ENABLED should only be used “when you have long-lived traces or traces with many spans”. Why? What artifacts would be introduced if we turned this on?

Partial flush is only supported in recent versions of the agent, so you’ll need to make sure you’re using version 7.26.0+. The main artifact you’ll find is that child spans will appear in Datadog, with the parent span only arriving later. If you had a span that lasts the lifetime of the application, for example, it wouldn’t be flushed until after the app shuts down for example.

Excellent! I’ll get that deployed to a test environment tomorrow. I should have some results by tomorrow afternoon (7/15).