runtime: AccessViolationException 0xc0000005 in CopyWatsonBucketsBetweenThrowables / coreclr.dll

Description

A Windows service running in production environment crashes periodically (once-twice a week) with AVs as seen below:

Faulting application name: XXXXX.exe, version: 0.0.0.0, time stamp: 0x5f6b3998
Faulting module name: coreclr.dll, version: 4.700.20.47201, time stamp: 0x5f6a7a28
Exception code: 0xc0000005
Fault offset: 0x0000000000232e2b

WinDbg: the point of failure in FW 3.1 is this line:

PTR_VOID pRawSourceWatsonBucketArray = dac_cast<PTR_VOID>(refSourceWatsonBucketArray->GetDataPtr());

Configuration

  • .NET Core 3.1 (coreclr 4.700.20.47201 and 4.700.20.36602)
  • Windows 2016 DataCenter Server x64
  • 64 vCPUs, 400 GB of RAM

Regression?

On a first sight this started immediately after the upgrade to v3.1.7 (dotnet-sdk-3.1.401-win-x64) (coreclr.dll 4.700.20.36602) on August 30 (fault offset 0x0000000000232e1b).

Update 1

We found 5 more crashes in our logs going back to beginning of 2020 (although not as frequent as it happens now):

  • 3 crashes on coreclr 4.700.20.6602
  • 2 crashes on coreclr 4.700.20.20201

Update 2

Jan 23 2019: a similar exception handing failure:

Faulting application name: XXXXX.exe, version: 0.0.0.0, time stamp: 0x5c007e95
Faulting module name: coreclr.dll, version: 4.6.27129.4, time stamp: 0x5c00327e
Exception code: 0xc0000005
Fault offset: 0x00000000001a3b8d

WinDbg > [e:\a_work\104\s\src\vm\exceptionhandling.cpp @ 1029] (00000001800379e0) coreclr!ProcessCLRException+0x16c1ad | (0000000180037e00) coreclr!ExceptionTracker::ProcessOSExceptionNotification

Other information

The crash dump attached in Visual Studio:

image

There is a normal .NET exception ends up in the crash at excep.cpp#L10309.

Outer exception:

  • {“Exception has been thrown by the target of an invocation.”} System.Reflection.TargetInvocationException
  • SerializationWatsonBuckets: null

InnerException

  • {“InitializingException: Initializing”} System.Exception {InitializingException}
  • SerializationWatsonBuckets {byte[5616]}

Note that the application is in startup mode and there are quite a few of these Initializing(s) flying around, increasing the possibility of the bug to present itself (if there is one). Indeed, the crash is more likely to happen during the takeoff than during the cruising phase of the process lifetime.

Looking around a bit I noticed that the method that hits the null ref is not supposed to run unless AreWatsonBucketsPresent returns true. Could it be a race condition?

Please advice if upgrading to FW 5.0 or moving to another OS could help with the issue short term.

I will provide more details as necessary.

Thank you.

FYI @karelz

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 30 (20 by maintainers)

Most upvoted comments

That error represents an internal .NET error. It represents an unexpected failure that should not happen under normal circumstances and we cannot continue without a risk of further corruption of the application state.

The reason why the one we are seeing sometimes causes the internal error and sometimes it crashes with AV is that it depends on where the racing thread overwrites the watson buckets in the exception object. The native runtime code that processes the exception has multiple places where the code path depends on whether the watson buckets are null or not. So e.g. in the AV case, there is an if that checks whether the buckets are present or not and if it finds they are, it calls another method that expects that the buckets are present. But the other thread overwrites the reference by null right after the check, so we enter a method that tries to use it as a valid reference and crashes. AVs in the native runtime are not translated to null reference exceptions except for few well defined places.

Reusing single exception instance by multiple threads at the same time doesn’t sound like a good practice. I wander what is your reason for doing that? If your reason was to save allocations of the exception objects (reducing the amount of garbage), reusing single one actually causes more allocations. ExceptionDispatchInfo.Capture allocates the ExceptionDispatchInfo, the RestoreDispatchState makes e.g. a deep copy of the stack trace (https://github.com/dotnet/runtime/blob/f8f63b1fde85119c925313caa475d9936297b463/src/coreclr/System.Private.CoreLib/src/System/Exception.CoreCLR.cs#L202-L203).

Having said that, the way you reuse exception objects should not result in runtime crash / internal error, so we should fix that race. However, I think it is not likely that such a fix would be ported back to 5.0 or even 3.1, as this is something you can workaround by not reusing the exception by multiple threads at the same time.