runtime: Test failure JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
Run:runtime-coreclr gcstress-extra 20220508.1
Failed test:
coreclr OSX arm64 Checked gcstress0xc_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
Error message:
cmdLine:/private/tmp/helix/working/B4220957/w/A7D308F6/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcs2_r/lcs2_r.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 5/8/2022 5:10:55 PM, end: 5/8/2022 6:40:58 PM)
Return code: -100
Raw output file: /tmp/helix/working/B4220957/w/A7D308F6/uploads/Arrays/lcs/lcs2_r/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE COMPlus_GCStress IS SET
cmdLine:/private/tmp/helix/working/B4220957/w/A7D308F6/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcs2_r/lcs2_r.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 5/8/2022 5:10:55 PM, end: 5/8/2022 6:40:58 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/B4220957/w/A7D308F6/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcs2_r/lcs2_r.sh
Expected: True
Actual: False
Stack trace
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
Queued | OS | Arch | Pipeline |
---|---|---|---|
2022-05-09T00:03:15.936Z | osx.1200.arm64.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_jitstress2 |
2022-04-09T08:49:58.428Z | osx.1015.amd64.open | x64 | runtime-staging Release |
2022-03-30T20:41:58.472Z | osx.1015.amd64.open | x64 | runtime-staging Release |
2022-03-16T07:23:50.554Z | ubuntu.1804.amd64.android.29.open.svc | x64 | runtime-extra-platforms Release |
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 42 (24 by maintainers)
I think I have a fix. I am testing it now to make sure, but it looks promising.
Because Linux uses signals for hardware exception handling. For signals, there is no race, since we just return from the signal handler with an updated context.
I’ve forgotten to share more details I’ve found. I’ve used dtrace to monitor the delivery of the SIGCHILD. It reports that the signal was delivered to the target thread, but yet the handler was never called. I have thought that we might be doing something wrong in the signal handler or the thread that we use to invoke the handling code, but then I’ve found that when it hangs, the ctrl-c is delivered through the same signal handler and code path just fine. I am starting to suspect some bad interaction between signal delivery and very frequent hardware exceptions on macOS. I keep investigating.
More update. Monitoring the child process in the repro test (bash) using the
dtruss
tool, I have found that the child is actually running to completion even in the case of a hang. So, the problem is in the delivery of the SIGCHILD. I keep investigating.I have spent some time investigating this but so far I wasn’t successful in figuring out the culprit. From logging that I have added, it is clear that the target process was forked, the execve was called, but then the target process becomes zombie. I keep trying to figure out means to find out more.
As @jakobbotsch mentioned, the process is in a bad state after
execve
is called althoughexecve
does not return error. https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/native/libs/System.Native/pal_process.c#L423-L424Now, that causes that the
OnSigChild
callback never triggers because native code stops running, for some reason. https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs#L1090-L1091 I confirmed that by addingFile.WriteAllText("/dev/tty", "Some debug message")
toOnSigChild
and observing that it doesn’t print anything to the terminal when the hang occurs.I also went to
pal_signal.c
and addedprintf
s to the code that is meant to callOnSigChild
and that also doesn’t print on hangs https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/native/libs/System.Native/pal_signal.c#L385-L390I also attempted to log rather than printing i.e: write to a file from native c, and that confirmed that that code is indeed not executing as nothing is written when the test loop hangs.
I repeated printing and logging for
SignalHandlerLoop
andSystemNative_RegisterForSigChld
, those also don’t execute on hangs.It appears to me that the last thing that executes on native before the hang is
SystemNative_ForkAndExecProcess
, which is the function callingexecve
.This relates with my previous comment as
process.WaitForExit(timeout)
depends onOnSigChild
to be called.I assume this is some weird interaction between GCStress and
execve
on macOS arm64. I will stop looking into it and hope that someone more familiar with GCStress can take a look.cc @janvorli