ApplicationInsights-dotnet: Memory leak with Application Insights + asp .net core 3.1 + linux containers

FYI, this is also being tracked by Azure support as case 119121121001290

Repro Steps

Please see via my personal repo here at for both the sample repo as well as the pre-built containers. Also note that the restricted way azure app service spawns containers prevents you from running dotnet dump on azure (issue with diagnostics team) so this is reproduced outside azure with the standard VS2019 asp.net core 3.1 template.

Actual Behavior

The user visible behavior is that Azure App Service recycles the container with the log entry

Stoping site <sitename> because it exceeded swap usage limits.

What happens under the hoods is that it takes between 1 to 4 days for the memory leak to exhaust about 1.5 GB memory. This is seen as high used and low available memory as see by the free -w -h linux tool (which uses the /proc/meminfo from container host).

              total        used        free      shared     buffers       cache   available
Mem:           1789        1258          81           5           8         441         378
Swap:             0           0           0

Eventually the container is killed (exit code 137, low memory).

Disabling Application Insights by not setting the APPINSIGHTS_INSTRUMENTATIONKEY environment variable prevents this leak.

Expected Behavior

No memory leak

Version Info

SDK Version :

  <ItemGroup>
    <PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.12.1" />
    <PackageReference Include="Microsoft.ApplicationInsights.Profiler.AspNetCore" Version="2.1.0-beta1" />
    <PackageReference Include="Microsoft.Extensions.Logging.ApplicationInsights" Version="2.12.1" />
  </ItemGroup>

.NET Version : .net core 3.1 How Application was onboarded with SDK: csproj integration Hosting Info (IIS/Azure WebApps/ etc): Azure App Service Linux Containers as well as locally as simple linux containers

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 26 (10 by maintainers)

Most upvoted comments

Hi @SidShetye, just wanted to give another update on progress. Due to the tireless efforts of my coworker @sywhang we found a few issues that had varying contributions, but the one that appears most problematic plays out like this: 9c08f2e5-9ef7-4554-bd24-af8f9ba69943

The yellow is a baseline scenario that we modified slightly from the one you provided to make ApplicationInsights service profiler capture traces much more frequently. The blue line is after we applied a tentative fix in the coreclr runtime. x-axis is seconds, y-axis is commited memory in MB and we ran both test cases long enough that we thought the trend was apparent.

Interestingly we thought the native CRT heap was growing because something was calling malloc() and forgetting to call free() but that wasn’t the culprit. Instead the malloc/free calls were correctly balanced but there was a bad interaction between the runtime allocation pattern and the glibc heap policies that decide when unused CRT heap virtual memory should be unmapped and where new allocations should be placed. We didn’t determine why specifically our pattern was causing so much trouble for glibc, but the tentative fix is using a new custom allocator that gets a pool of memory directly from the OS with mmap().

This particular issue is in the coreclr runtime (Service profiler is using a runtime feature to create and buffer the trace) so we’d need to address it via a runtime fix. Our current tentative fix still has worse throughput than the original so we are profiling it and refining it in hopes that we don’t fix one problem only to create a new one. We’ll also need to meet with our servicing team to figure out what options we have for distributing a fix once we have it.

I’ll continue to follow up here as we get more info. In the meantime we found that adjusting the environment variable MALLOC_ARENA_MAX to a low value appeared to substantially mitigate the memory usage so that is a potential option for you to experiment with if you are trying to get this working as soon as possible. Here are a few links to other software projects that were dealing with similar issues to add a little context: https://github.com/prestodb/presto/issues/8993 https://github.com/cloudfoundry/java-buildpack/issues/320

Still happening in Net6.0

image

The fix to this issue has been merged to both master branch and .NET Core 3.1 servicing branch (targetting 3.1.5 servicing release) of the runtime repo. I don’t have write access to this repository but I believe the issue can be closed now.

Thanks @SidShetye for the update! I’ll try to get the backport approved regardless so that others won’t hit the same issue.

Regarding writeup, I don’t have anything written up on this yet because we’re so close to .NET 5 release snap date and I have a huge stack of work items to go through, but I am definitely planning to share some of the tools I wrote as part of this investigation, as well as a writeup on the investigation as soon as I have some extra cycles : )

Hi @SidShetye a quick update and thanks for your continued patience on this long bug. The custom tooling work we did is bearing dividends and we isolated two callstacks with high suspicion of leaking memory. As of a few days ago we switched modes to root causing, creating tentative fixes, and doing measurements to assess the impact of the changes. Those tasks are still in progress but it is very satisfying to have some leads now.

This looks related to a similar issue I’m having with a .netcore 3.1 web api running in azure app services, The memory slowly builds up through a day maxing out the instances and causing the pools to be recycled, Attached is the memory dump we got from production where we can see that the issue comes from the AddApplicationInsightsTelemetry() method. appinsights leak