runtime: Possible JIT optimization bug on .NET 5 preview

Description

On a managed .NET 5 C# project, I get intermittent errors, sometimes an AccessViolationException, other times ExecutionEngineException, and even NullReferenceException (?) with no managed code in sight on the debugger.

The code does mathematical optimization and is fully deterministic, however the exceptions are not always consistent even though they are frequent.

On the following scenarios these exceptions have never appeared so far (I tried a few times):

  • Debug mode.
  • Release mode with Optimize=False,
  • Release mode with Optimize=True and COMPlus_JITMinOpts=1.

Unfortunately, I don’t know how to create a small program to reproduce the problem, but there seemed to be a particular point where the code breaks the most:

image

For context, _complement is declared as: private readonly ImmutableArray<ulong> _complement An extra detail: in the constructor I initialize a regular ulong[] and use Unsafe.As<ulong[], ImmutableArray<ulong>> to set _complement.

Here is a misterious NullReferenceException (on a value type?!) image

JIT and compiler optimizations have a lot of impact on the performance in this project so I don’t think disabling these optimizations is a long term solution.

Configuration

.NET 5.0.0-preview.6.20305.6

Also tried preview 4 before and was getting similar exceptions.

category:correctness theme:testing skill-level:expert cost:large

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (14 by maintainers)

Commits related to this issue

Most upvoted comments

+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated.

Think I’ve finally figured this one out.

Recall the release jit only sometimes generates bad GC info, and the checked jit never does.

The release jit can sometimes end up in a situation where fgFirstBB does not have the BBF_HAS_LABEL or BBF_JMP_TARGET flags set. If this happens and there are GC references live into the method body, they won’t be reported properly as the jit won’t create a label for the first BB.

A checked jit will typically always set BBF_HAS_LABEL flag on fgBirstBB because of this bit of code:

https://github.com/dotnet/runtime/blob/50a999dd3597a33b62a7a1a1712c849e764d50e9/src/coreclr/src/jit/codegenlinear.cpp#L99-L102

It turns out fgHasSwitch is never explicitly initialized, so it tends to be set to true in checked builds, given the default fill pattern (the only fill pattern that would have exposed this error in a checked jit is 00 which currently can’t be used).

In release builds, fgHasSwitch will have a somewhat random value in methods that don’t have switches. Apparently a random 0 tends to be rare, so most of the time the release jit will also set the BBF_HAS_LABEL flag on fgBirstBB. But occasionally the value is 0 and the release jit fails to create a label for the first BB, and thus leaves the jit in position to generate bad GC info.

The above explains why checked jits always produce good GC info, and release jits mostly always do.

This is a regression; the precipitating change is quite likely #1309. Before that change it wasn’t possible (or at least wasn’t common) for the jit to create a scratch BB after building pred lists. The order of these is significant; fgComputePreds sets both BBF_HAS_LABEL and BBF_JMP_TARGET flags on the first block. Post #1309 the jit might create a scratch BB after fgComputePreds has run, and in that case the new first BB does not have the right flags set.

The simple minimal fix is to initialize fgHasSwitch to false in fgInit, and then update fgEnsureFirstBBisScratch to set the necessary flags on the new first BB.

I have sent an updated jit to @dellamonica to test; hopefully this is indeed the problem.

@dellamonica thanks for reporting this, and for your help and patience in tracking this down.