ApplicationInsights-dotnet: Memory leak with Application Insights + asp .net core 3.1 + linux containers
FYI, this is also being tracked by Azure support as case 119121121001290
Repro Steps
Please see via my personal repo here at for both the sample repo as well as the pre-built containers. Also note that the restricted way azure app service spawns containers prevents you from running dotnet dump
on azure (issue with diagnostics team) so this is reproduced outside azure with the standard VS2019 asp.net core 3.1 template.
Actual Behavior
The user visible behavior is that Azure App Service recycles the container with the log entry
Stoping site <sitename> because it exceeded swap usage limits.
What happens under the hoods is that it takes between 1 to 4 days for the memory leak to exhaust about 1.5 GB memory. This is seen as high used and low available memory as see by the free -w -h
linux tool (which uses the /proc/meminfo from container host).
total used free shared buffers cache available
Mem: 1789 1258 81 5 8 441 378
Swap: 0 0 0
Eventually the container is killed (exit code 137, low memory).
Disabling Application Insights by not setting the APPINSIGHTS_INSTRUMENTATIONKEY
environment variable prevents this leak.
Expected Behavior
No memory leak
Version Info
SDK Version :
<ItemGroup>
<PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.12.1" />
<PackageReference Include="Microsoft.ApplicationInsights.Profiler.AspNetCore" Version="2.1.0-beta1" />
<PackageReference Include="Microsoft.Extensions.Logging.ApplicationInsights" Version="2.12.1" />
</ItemGroup>
.NET Version : .net core 3.1
How Application was onboarded with SDK: csproj integration
Hosting Info (IIS/Azure WebApps/ etc): Azure App Service Linux Containers as well as locally as simple linux containers
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 26 (10 by maintainers)
Hi @SidShetye, just wanted to give another update on progress. Due to the tireless efforts of my coworker @sywhang we found a few issues that had varying contributions, but the one that appears most problematic plays out like this:
The yellow is a baseline scenario that we modified slightly from the one you provided to make ApplicationInsights service profiler capture traces much more frequently. The blue line is after we applied a tentative fix in the coreclr runtime. x-axis is seconds, y-axis is commited memory in MB and we ran both test cases long enough that we thought the trend was apparent.
Interestingly we thought the native CRT heap was growing because something was calling malloc() and forgetting to call free() but that wasn’t the culprit. Instead the malloc/free calls were correctly balanced but there was a bad interaction between the runtime allocation pattern and the glibc heap policies that decide when unused CRT heap virtual memory should be unmapped and where new allocations should be placed. We didn’t determine why specifically our pattern was causing so much trouble for glibc, but the tentative fix is using a new custom allocator that gets a pool of memory directly from the OS with mmap().
This particular issue is in the coreclr runtime (Service profiler is using a runtime feature to create and buffer the trace) so we’d need to address it via a runtime fix. Our current tentative fix still has worse throughput than the original so we are profiling it and refining it in hopes that we don’t fix one problem only to create a new one. We’ll also need to meet with our servicing team to figure out what options we have for distributing a fix once we have it.
I’ll continue to follow up here as we get more info. In the meantime we found that adjusting the environment variable MALLOC_ARENA_MAX to a low value appeared to substantially mitigate the memory usage so that is a potential option for you to experiment with if you are trying to get this working as soon as possible. Here are a few links to other software projects that were dealing with similar issues to add a little context: https://github.com/prestodb/presto/issues/8993 https://github.com/cloudfoundry/java-buildpack/issues/320
Still happening in Net6.0
The fix to this issue has been merged to both master branch and .NET Core 3.1 servicing branch (targetting 3.1.5 servicing release) of the runtime repo. I don’t have write access to this repository but I believe the issue can be closed now.
Thanks @SidShetye for the update! I’ll try to get the backport approved regardless so that others won’t hit the same issue.
Regarding writeup, I don’t have anything written up on this yet because we’re so close to .NET 5 release snap date and I have a huge stack of work items to go through, but I am definitely planning to share some of the tools I wrote as part of this investigation, as well as a writeup on the investigation as soon as I have some extra cycles : )
Hi @SidShetye a quick update and thanks for your continued patience on this long bug. The custom tooling work we did is bearing dividends and we isolated two callstacks with high suspicion of leaking memory. As of a few days ago we switched modes to root causing, creating tentative fixes, and doing measurements to assess the impact of the changes. Those tasks are still in progress but it is very satisfying to have some leads now.
This looks related to a similar issue I’m having with a .netcore 3.1 web api running in azure app services, The memory slowly builds up through a day maxing out the instances and causing the pools to be recycled, Attached is the memory dump we got from production where we can see that the issue comes from the
AddApplicationInsightsTelemetry()
method.