runtime: OutOfMemoryException at Monitor.ReliableEnterTimeout with plenty free memory available

Description

A process running under FW 3.1 threw OutOfMemoryException (OOM), then stopped by invoking FailFast

There are 2 threads:

  1. First thread catches OOM from Monitor.ReliableEnterTimeout and places that error into a Application.Logging.Logger instance queue for processing. The call stack there is what is mentioned in the issue description:
   at System.Threading.Monitor.ReliableEnterTimeout(Object obj, Int32 timeout, Boolean& lockTaken)
   at Application.Profiling.SessionTimer.Session.DurationInstance.TryGetExpiredActionDurationInstance(ActionDurationInstance& instance)
   at Application.Profiling.SessionTimer.Session.ProcessExpiredDurationInstances()
   at Application.Profiling.SessionTimer.Session.RunFireElapsedEvent(Object state)
  1. The other thread reads the queued exception from the queue and calls FailFast with the exception information because of of OOM is assumed to be a non-recoverable, fatal state for the application:
   at System.Environment.FailFast(System.String)
   at Application.Logging.Logger.ExceptionLoggerAsync(System.Exception)
   at Application.Logging.LogProcessor`1+<<RunAsync>b__15_0>d
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task+DelayPromise.CompleteTimedOut()

The OOM came out of blue as the software was running on a system with 200GB RAM still available.

The memory graph with the OOM moment: image

To monitor the system performance we are running “continuous” PerfView on the box all the time as

C:\PerfView\PerfView.exe collect -CollectMultiple=1000000 -MaxCollectSec=10 -AcceptEULA -NoView -NoGui -CircularMB=1024 -BufferSize=1024 -CPUSampleMSec:10 -ClrEvents=JITSymbols+GC+GCHeapSurvivalAndMovement+Stack -KernelEvents=process+thread+ImageLoad+Profile+ThreadTime -DotNetAllocSampled -NoNGenRundown -NoV2Rundown -LogFile:log.txt

The collected PerfView set most closely located to the moment of the crash showed about 1% of the CPU activity in Monitor.Enter/TryEnter likely seen as MonReliableEnter_Portable:

image

In turn, SyncBlockCache::GetNextFreeSyncBlock came along (3.1 SyncBlock.cpp, 5.0 SyncBlock.cpp):

image

There are a few possibilities for an OM there, for instance in SyncBlockCache::GetNextFreeSyncBlock

            SyncBlockArray* newsyncblocks = new(SyncBlockArray);
            if (!newsyncblocks)
                COMPlusThrowOM ();

Configuration

  • Windows Server 2016
  • .Net Core 3.1.12
  • CoreCLR Version: 4.700.21.6504
  • 64 vCPUs VM in GCE
  • 600 GB RAM
  • Auto-dump disabled to speedup server restart during crash (considering the time it takes to collect the 400GB large process dump)

Questions

  • Where else could OOM be triggered on this particular call path if not in syncblock.cpp?
  • Under what conditions the OOM may happen on a system with that much free RAM available?
  • Could this somehow related to the concurrent Perfview activity?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 43 (21 by maintainers)

Commits related to this issue

Most upvoted comments

I do not think I can get it to .NET 5 (it is going out of support soon). I will try to get it to .NET 6.