runtime: Test failure JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd

Run:runtime-coreclr gcstress-extra 20220508.1

Failed test:

coreclr OSX arm64 Checked gcstress0xc_jitstress2 @ OSX.1200.ARM64.Open

- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd

Error message:

cmdLine:/private/tmp/helix/working/B4220957/w/A7D308F6/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcs2_r/lcs2_r.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 5/8/2022 5:10:55 PM, end: 5/8/2022 6:40:58 PM)

Return code:      -100
Raw output file:      /tmp/helix/working/B4220957/w/A7D308F6/uploads/Arrays/lcs/lcs2_r/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE COMPlus_GCStress IS SET

cmdLine:/private/tmp/helix/working/B4220957/w/A7D308F6/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcs2_r/lcs2_r.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 5/8/2022 5:10:55 PM, end: 5/8/2022 6:40:58 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.

/private/tmp/helix/working/B4220957/w/A7D308F6/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcs2_r/lcs2_r.sh
Expected: True
Actual:   False


Stack trace
   at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
   at Program.<Main>$(String[] args)
Queued OS Arch Pipeline
2022-05-09T00:03:15.936Z osx.1200.arm64.open arm64 runtime-coreclr gcstress-extra Checked-gcstress0xc_jitstress2
2022-04-09T08:49:58.428Z osx.1015.amd64.open x64 runtime-staging Release
2022-03-30T20:41:58.472Z osx.1015.amd64.open x64 runtime-staging Release
2022-03-16T07:23:50.554Z ubuntu.1804.amd64.android.29.open.svc x64 runtime-extra-platforms Release

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 42 (24 by maintainers)

Most upvoted comments

I think I have a fix. I am testing it now to make sure, but it looks promising.

Why does the race only happen on macOS and not on Linux as well?

Because Linux uses signals for hardware exception handling. For signals, there is no race, since we just return from the signal handler with an updated context.

I’ve forgotten to share more details I’ve found. I’ve used dtrace to monitor the delivery of the SIGCHILD. It reports that the signal was delivered to the target thread, but yet the handler was never called. I have thought that we might be doing something wrong in the signal handler or the thread that we use to invoke the handling code, but then I’ve found that when it hangs, the ctrl-c is delivered through the same signal handler and code path just fine. I am starting to suspect some bad interaction between signal delivery and very frequent hardware exceptions on macOS. I keep investigating.

More update. Monitoring the child process in the repro test (bash) using the dtruss tool, I have found that the child is actually running to completion even in the case of a hang. So, the problem is in the delivery of the SIGCHILD. I keep investigating.

I have spent some time investigating this but so far I wasn’t successful in figuring out the culprit. From logging that I have added, it is clear that the target process was forked, the execve was called, but then the target process becomes zombie. I keep trying to figure out means to find out more.

As @jakobbotsch mentioned, the process is in a bad state after execve is called although execve does not return error. https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/native/libs/System.Native/pal_process.c#L423-L424

Now, that causes that the OnSigChild callback never triggers because native code stops running, for some reason. https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs#L1090-L1091 I confirmed that by adding File.WriteAllText("/dev/tty", "Some debug message") to OnSigChild and observing that it doesn’t print anything to the terminal when the hang occurs.

I also went to pal_signal.c and added printfs to the code that is meant to call OnSigChild and that also doesn’t print on hangs https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/native/libs/System.Native/pal_signal.c#L385-L390

I also attempted to log rather than printing i.e: write to a file from native c, and that confirmed that that code is indeed not executing as nothing is written when the test loop hangs.

I repeated printing and logging for SignalHandlerLoop and SystemNative_RegisterForSigChld, those also don’t execute on hangs.

It appears to me that the last thing that executes on native before the hang is SystemNative_ForkAndExecProcess, which is the function calling execve.

This relates with my previous comment as process.WaitForExit(timeout) depends on OnSigChild to be called.


I assume this is some weird interaction between GCStress and execve on macOS arm64. I will stop looking into it and hope that someone more familiar with GCStress can take a look.

cc @janvorli