runtime: Assert failure: m_alignpad == 0 in libraries tests

Hit in https://github.com/dotnet/runtime/pull/70226

Dump and logs: https://dev.azure.com/dnceng/public/_build/results?buildId=1806089&view=ms.vss-test-web.build-test-results-tab&runId=48090090&resultId=197319&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Stacktrace:

coreclr!DbgAssertDialog+0x1af [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 594] 
coreclr!ObjHeader::IllegalAlignPad+0x33 [D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp @ 2952] 
coreclr!ObjHeader::GetBits+0x41 [D:\a\_work\1\s\src\coreclr\vm\syncblk.h @ 1545] 
coreclr!ObjHeader::Validate+0x50 [D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp @ 2042] 
coreclr!Object::ValidateInner+0x493 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 581] 
coreclr!Object::Validate+0xa1 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 498] 
coreclr!OBJECTREF::operator->+0x21 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 1242] 
coreclr!MarshalNative::GCHandleInternalAlloc+0x171 [D:\a\_work\1\s\src\coreclr\vm\marshalnative.cpp @ 492] 
System_Private_CoreLib!System.ReadOnlyMemory`1[[System.Byte, System.Private.CoreLib]].Pin()+0xffffffff`a4114a41

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 28 (28 by maintainers)

Commits related to this issue

Most upvoted comments

@Maoni0 has a fix already. I also found a reasonably frequent repro last week - the readytorun\coreroot_determinism\coreroot_determinism coreclr test and run it with her fix over the weekend. Before the fix, it reproed about every 80 iterations. With the fix, 1000 iterations have passed without any repro.

@Maoni0 @janvorli A sort of “stable” repro on Windows-x64:

build.cmd Clr+Libs -c Release -rc Checked

cd src\libraries\System.Collections\tests

dotnet build -c Release

cd C:\prj\runtime\artifacts\bin\System.Collections.Tests\Release\net7.0

$env:DOTNET_TieredCompilation=1
$env:DOTNET_TieredPGO=1
$env:DOTNET_JitRandomGuardedDevirtualization=1
$env:DOTNET_ReadyToRun=0

for(;;){C:\prj\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\dotnet.exe exec --runtimeconfig System.Collections.Tests.runtimeconfig.json --depsfile System.Collections.Tests.deps.json xunit.console.dll System.Collections.Tests.dll  -xml testResults.xml -nologo -notrait category=OuterLoop -notrait category=failing}

^ eventually fails with:

Assert failure(PID 20008 [0x00004e28], Thread: 32408 [0x7e98]): m_alignpad == 0

CORECLR! ObjHeader::Validate + 0x50 (0x00007ffd`6a8de630)
CORECLR! Object::ValidateInner + 0x493 (0x00007ffd`6a865773)
CORECLR! Object::Validate + 0xA1 (0x00007ffd`6a8652a1)
CORECLR! OBJECTREF::operator-> + 0x21 (0x00007ffd`6a85e951)
CORECLR! JIT_ClassProfile32 + 0xC7 (0x00007ffd`6aa82707)
<no module>! <no symbol> + 0x0 (0x00007ffd`0deade75)
<no module>! <no symbol> + 0x0 (0x0000013c`a74005f0)
<no module>! <no symbol> + 0x0 (0x00000037`cdd7bd30)
<no module>! <no symbol> + 0x0 (0x00007ffd`0d441780)
<no module>! <no symbol> + 0x0 (0x0000013c`00000000

I also observe the behaviour @jkotas noticed that m_alignpad is quickly cleared, I even put multiple if (m_alignpad != 0) { checks there

This assert can be hit for many different GC holes, GC heap corruptions and GC bugs. It is important to at least capture the stacktrace where the assert is hit, and track different stack traces by separate issues.

I think it is unlikely that the Linux arm crash you have seen is same root cause as this issue. I expect that it is going to have a different stacktrace. Unfortunately, we cannot tell for sure since no dump was collected.

I am closing this as non-actionable. If you see a test failing with this assert, do not re-activate this issue. Instead open a new issue and capture stack trace where the assert is hit in the issue description.

Thank you very much, @EgorBo! Based on your list of instructions, I was able to fairly quickly repro the assertion failure on my Windows desktop and have confirmed that after:

  1. Removing the demotion check:
//  if ((g == 0) && hp->settings.demotion)
//        return NULL;//could be racing with another core allocating
  1. Adding a null check on the GCSafeMethodTable:
   Object * nextObj = GCHeapUtilities::GetGCHeap ()->NextObj (this);
            if ((nextObj != NULL) &&
                (nextObj->GetGCSafeMethodTable() != nullptr) &&
                (nextObj->GetGCSafeMethodTable() != g_pFreeObjectMethodTable))

the assertion failure doesn’t occur based on running the tests overnight.

The main question I had was: how do I find out which test did the assertion failure occur for?

Right now, I get a pretty general message that there was an assertion failure for a System.Collections.Test but no indication as to which test it failed on:

=== TEST EXECUTION SUMMARY ===
   System.Collections.Tests  Total: 32703, Errors: 0, Failed: 0, Skipped: 0, Time: 26.645s
  Discovering: System.Collections.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Collections.Tests (found 5562 of 7453 test cases)
  Starting:    System.Collections.Tests (parallel test collections = on, max threads = 20)

Assert failure(PID 24604 [0x0000601c], Thread: 18352 [0x47b0]): m_alignpad == 0

CORECLR! ObjHeader::Validate + 0x43 (0x00007ffa`66f8bcc3)
CORECLR! Object::ValidateInner + 0x493 (0x00007ffa`66f12ed3)
CORECLR! Object::Validate + 0xA1 (0x00007ffa`66f12a01)
CORECLR! OBJECTREF::operator-> + 0x21 (0x00007ffa`66f0c0b1)
CORECLR! JIT_ClassProfile32 + 0xC7 (0x00007ffa`671310b7)
<no module>! <no symbol> + 0x0 (0x00007ffa`0a65e527)
<no module>! <no symbol> + 0x0 (0x000000dd`6cafb030)
<no module>! <no symbol> + 0x0 (0x000000dd`6cafac00)
<no module>! <no symbol> + 0x0 (0x00000276`e4fd77f0)
<no module>! <no symbol> + 0x0 (0x00000276`e543c740)
    File: C:\runtime\src\coreclr\vm\syncblk.cpp Line: 2955
    Image: C:\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\dotnet.exe

  Discovering: System.Collections.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Collections.Tests (found 5562 of 7453 test cases)
  Starting:    System.Collections.Tests (parallel test collections = on, max threads = 20)
  Finished:    System.Collections.Tests

If we know the exact test, we’ll have a quicker repro with a more targeted run.

@mrsharm in my repro you can append -verbose after -notrait category=failing and xunit will print test names I think it were diffrient tests each run and I wasn’t able to reproduce the issue when I was asking xunit to run just one test specifically (via -method xx)

and thanks @EgorBo for finding another test that repros consistently.

note: doesn’t repro when I change optimization level from -O2 to -O0 for cee_wks_core (Checked config)

We run of libraries tests with checked coreclr on Windows in windows.10.amd64.open.rt queue only. I do not see any other queues running this combination on Windows.

Here is a query you can use to find all libraries tests that failed with runtime asserts in last 10 days. (Some of these failed with a different assert.)

https://dataexplorer.azure.com/clusters/engsrvprod/databases/engineeringdata

let wi =
WorkItems
| join kind=leftsemi (Jobs | where Queued > ago (10d) ) on $left.JobName == $right.Name
| where ExitCode == -1073740286;
Files
| lookup kind=inner wi on $left.WorkItemName == $right.Name
| where FileName == "how-to-debug-dump.md"
| join WorkItems on $left.WorkItemName == $right.Name
| join Jobs on $left.JobName == $right.Name
| extend PhaseName = tostring(parse_json(Properties)["System.PhaseName"]),
Pipeline = tostring(parse_json(Properties).DefinitionName),
BuildId = tostring(parse_json(Properties).BuildId)
| where Pipeline !contains("jitstress")
| project Timestamp, QueueName, ExitCode, Uri, ConsoleUri, PhaseName, Pipeline, BuildId