runtime: Strange behavior between Release and Debug builds [Random access violations and null exceptions]

Description

First the disclaimer: Our code is not saint, it could totally be our fault, I just don’t know where to look anymore.

This piece of code works properly when code generated is in debug mode, but not when it is release. However, there is another interesting behavior. This error started appearing after implementing a new high performance SIMD sorting code based on the same algorithm used by the Garbage Collector. So the biggest difference is we are calling that routine. When either the caller assembly Corax or the assembly that host that code Sparrow.Server is emitting debug code the error does not show itself.

image As you can see from the image, I test for null before the call to .Fill() and when I try to do the check again it triggers a null exception. Furthermore match is an struct so there is no option on it becoming null unless somehow the return pointer is wrong or something overwrites the stack.

That counter measures how reliably the issue is triggered. The failure is totally unreliable sometimes it takes 1150 others 1270, and so on.

Reproduction Steps

  1. Clone the repo: https://github.com/redknightlois/ravendb/tree/repro-release-mode-memoryissue
  2. Execute in release mode: Voron.Benchmark
  3. After 1200+ rounds the error happens at: https://github.com/redknightlois/ravendb/blob/repro-release-mode-memoryissue/src/Corax/Queries/SortingMatch/SortingMatch.cs#L244

Expected behavior

No null exception.

Actual behavior

Null exception

Regression?

No response

Known Workarounds

No

Configuration

C# 11, .Net 7.0, SDK 7.0.100, Windows 10 , AMD x64

Other information

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 28 (18 by maintainers)

Most upvoted comments

So it has more than enough room for almost any type mostly because I define it in ints instead of bytes.

The code here computes

var tmpStartRight = tmpStartLeft + PARTITION_TMP_SIZE_IN_ELEMENTS; where tmpStartLeft is a long*. So tmpStartRight ends up pointing outside _temp. If I change _temp to be an array of longs then the problem disappears.

The problem seems to disappear if I change the initialization to a more well-defined (from the C# side):

-                var sorter = new Avx2VectorizedSort(il, ir);
+                Avx2VectorizedSort.Init(il, ir, out Avx2VectorizedSort sorter);

This is not a duplicate #78206. If you set DOTNET_TieredCompilation=0, it will crash immediately in the first iteration and the GC did not run at all at that point.

@dotnet/jit-contrib This looks like a codegen optimization bug. Could you please take a look?

It looks like it’s VXSort related, btw, we recently had to patch our version due to potential buffer overruns https://github.com/dotnet/runtime/pull/75364/files